Today we learn how to fetch all links in a website . We write a small script for that purpose. I came across this when I was reading bash guide.

example microsoft.com : I want to get list of all domains that are present on microsoft.com and fetch their IP Addresses.

Step 1 : Fetch microsoft.com

Save it as html or wget the page.

root@ETHICALHACKX:~# wget microsoft.com
--2020-02-16 22:51:31--  http://microsoft.com/
Resolving microsoft.com (microsoft.com)... 13.77.161.179, 40.113.200.201, 40.112.72.205, ...
Connecting to microsoft.com (microsoft.com)|13.77.161.179|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.microsoft.com/ [following]
--2020-02-16 22:51:32--  https://www.microsoft.com/
Resolving www.microsoft.com (www.microsoft.com)... 106.51.146.24, 2600:140f:4:1a1::356e, 2600:140f:4:186::356e
Connecting to www.microsoft.com (www.microsoft.com)|106.51.146.24|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://www.microsoft.com/en-in/ [following]
--2020-02-16 22:51:32--  https://www.microsoft.com/en-in/
Reusing existing connection to www.microsoft.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’

index.html                   [ <=>                            ] 153.15K  --.-KB/s    in 0.1s    

2020-02-16 22:51:32 (1.48 MB/s) - ‘index.html’ saved [156821]

root@ETHICALHACKX:~# ls -l index.html
-rw-r--r-- 1 root root 156821 Feb 16 22:51 index.html
root@ETHICALHACKX:~# 

Step 2 : We need links, that are in the form of:
<li><a href="http://xbox.microsoft.com/">XBox</a></li>

Step 4 : Cut the lines with links
We use grep to cut the lines that contain any link within them, or in sort we get the lines with “href” in them.
grep "href=" index.html

root@ETHICALHACKX:~# grep "href=" index.html
        <link rel="dns-prefetch" href="https://assets.onestore.ms" />
        <link rel="preconnect" href="https://assets.onestore.ms" />
        <link rel="dns-prefetch" href="https://web.vortex.data.microsoft.com" />
        <link rel="preconnect" href="https://web.vortex.data.microsoft.com" />
        <link rel="dns-prefetch" href="https://mem.gfx.ms" />
        <link rel="preconnect" href="https://mem.gfx.ms" />
        <link rel="dns-prefetch" href="https://img-prod-cms-rt-microsoft-com.akamaized.net" />
        <link rel="preconnect" href="https://img-prod-cms-rt-microsoft-com.akamaized.net" />
        <link rel="dns-prefetch" href="https://microsoftwindows.112.2o7.net" />
        <link rel="preconnect" href="https://microsoftwindows.112.2o7.net" />
    <link rel="SHORTCUT ICON" href="https://c.s-microsoft.com/favicon.ico?v2" type="image/x-icon" />

Step 5 : We further clear the output by removing text. We can notice the “/” character delimiter at 3rd position, let’s use it.

root@ETHICALHACKX:~# grep "href=" index.html | cut -d "/" -f3
assets.onestore.ms" 
assets.onestore.ms" 
web.vortex.data.microsoft.com" 
web.vortex.data.microsoft.com" 
mem.gfx.ms" 
mem.gfx.ms" 
img-prod-cms-rt-microsoft-com.akamaized.net" 
img-prod-cms-rt-microsoft-com.akamaized.net" 
microsoftwindows.112.2o7.net" 
microsoftwindows.112.2o7.net" 
c.s-microsoft.com

www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com

Step 6 : The output we have is now better from previous one, lets put a bit more effort to clean the extra words appearing in the list.
We clean all lines that have period “.”

root@ETHICALHACKX:~# grep "href=" index.html | cut -d "/" -f3 | grep "\."
assets.onestore.ms" 
assets.onestore.ms" 
web.vortex.data.microsoft.com" 
web.vortex.data.microsoft.com" 
mem.gfx.ms" 
mem.gfx.ms" 
img-prod-cms-rt-microsoft-com.akamaized.net" 
img-prod-cms-rt-microsoft-com.akamaized.net" 
microsoftwindows.112.2o7.net" 
microsoftwindows.112.2o7.net" 
c.s-microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com" aria-label="Microsoft" data-m='{"cN":"GlobalNav_Logo_cont","cT":"Container","id":"c3c2m1r1a1","sN":3,"aN":"c2m1r1a1"}'>
products.office.com
www.microsoft.com
www.microsoft.com

Step 7 : We make output more cleaner by filtering out the part of lines having ‘”‘ delimiter at position one.

root@ETHICALHACKX:~# grep "href=" index.html | cut -d "/" -f3 | grep "\." | cut -d '"' -f1
assets.onestore.ms
assets.onestore.ms
web.vortex.data.microsoft.com
web.vortex.data.microsoft.com
mem.gfx.ms
mem.gfx.ms
img-prod-cms-rt-microsoft-com.akamaized.net
img-prod-cms-rt-microsoft-com.akamaized.net
microsoftwindows.112.2o7.net
microsoftwindows.112.2o7.net
c.s-microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
www.microsoft.com
products.office.com
www.microsoft.com
www.microsoft.com
www.xbox.com
support.microsoft.com

Step 8: We now have a clean list but with duplicates, lets sort with -u (unique) argument.

root@ETHICALHACKX:~# grep "href=" index.html | cut -d "/" -f3 | grep "\." | cut -d '"' -f1 | sort -u
account.microsoft.com
assets.onestore.ms
azure.microsoft.com
careers.microsoft.com
channel9.msdn.com
choice.microsoft.com
c.s-microsoft.com
developer.microsoft.com
docs.microsoft.com
go.microsoft.com
img-prod-cms-rt-microsoft-com.akamaized.net
mem.gfx.ms
microsoftwindows.112.2o7.net
msdn.microsoft.com
news.microsoft.com
onedrive.live.com
outlook.live.com
privacy.microsoft.com
products.office.com
store.office.com
support.microsoft.com
technet.microsoft.com
twitter.com
visualstudio.microsoft.com
web.vortex.data.microsoft.com
www.facebook.com
www.microsoft.com
www.onenote.com
www.skype.com
www.xbox.com
www.youtube.com
root@ETHICALHACKX:~# 

Step 9 : Export this list to a text file.

root@ETHICALHACKX:~# grep "href=" index.html | cut -d "/" -f3 | grep "\." | cut -d '"' -f1 | sort -u > microsoft.txt
root@ETHICALHACKX:~# 

Step 10 : we now check the IP Address of each of the domains in the file saved by host command.

Step 11 : We again get many irrelevant output which we want to trim down, let’s again apply grep on output for “has address” and cut and again sort unique.

root@ETHICALHACKX:~# for url in $(cat microsoft.txt); do host $url; done | grep "has address" | cut  -d " " -f 4 | sort -u
104.121.243.97
104.121.246.126
104.122.12.200
104.244.42.65
106.51.144.228
106.51.144.82
106.51.146.105
106.51.146.24
13.107.42.11
13.107.42.13
13.107.42.16
13.235.141.20
13.235.224.156
13.92.199.137
172.217.160.142
172.217.163.110
172.217.163.174
172.217.163.206
172.217.163.46
172.217.163.78
172.217.166.110
172.217.167.142
172.217.26.174
172.217.26.206
172.217.31.206
184.29.11.224
192.237.225.141
202.83.22.200
202.83.22.218
216.58.196.174
216.58.197.46
216.58.197.78
216.58.200.142
23.8.183.228
23.8.185.225
23.8.187.90
23.8.188.96
31.13.79.35
40.77.226.250
52.109.120.67
52.109.56.1
52.113.194.133
65.52.210.213
root@ETHICALHACKX:~# 

So we got all the IP address for domains appearing on microsoft.com without much trouble. The same can be applied in various scenarios.

What Do You Think on This ? Say Here