Skip to content
Related Articles

Related Articles

Improve Article
Shell Script To Show All the Internal and External Links From a URL
  • Last Updated : 20 Apr, 2021

To make hierarchy among the webpages or the information web developers use to connect them all is called webpage linking. There are two types of webpage linking: one is internal linking, and the other is external linking. Internal links are those which link a page available on the same website to produce a cycle on the site. At the same time, external links are those which link to another website or domain. External links play a vital role in ranking a website on the search engine. Improvement in the website rank can be seen by increasing the number of external links to your website. Here we are asked to code a shell script that could print all these links on the terminal screen. The only input provided to the script is the URL of the webpage for which we need to fetch all the links.

Note: A website can be accessed in two ways: one in using a web browser, and the other is using terminal commands which follow limited protocols to access the website. Terminal commands have some limitations, so we will also use a terminal-based web browser, which will help us to connect to that website.

CLI:

For the command line, we are going to use the tool “lynx”. Lynx is a terminal-based web browser that did not show images and other multimedia content to make it much faster than other browsers.

# sudo apt install lynx -y
Shell script output all the Internal and External links from the website url

Install lynx terminal browser

Let us see the GeeksForGeeks project page links. But before we must understand the options present in the lynx browser.

  • -dump: This will dump the formatted output of the document.
  • -listonly: This will list all the links present on the URL mentioned. This used with -dump.

Now apply these options:



# lynx -dump -listonly https://www.geeksforgeeks.org/computer-science-projects/?ref=shm
Shell script output all the Internal and External links from the website url

dump all links on terminal

Or redirect this terminal output to any text file:

# lynx -dump -listonly https://www.geeksforgeeks.org/computer-science-projects/?ref=shm > links.txt

Now see the links using cat commands:

# cat links.txt

Shell script output all the Internal and External links from the website url

Shell Script

We could easily do all the work done above in a file using a scripting language, and it would be much easier and enjoyable as well. There are different ways to get the links, like regex. We will use regex with the “sed” command. First, we will download the webpage as text and then apply the regular expression on the text file.

Now we will create a file using the nano editor. Code explanation is given below.

# nano returnLinks.sh

Shell script output all the Internal and External links from the website url

Below is the implementation:



#!/bin/bash
# Give the  url
read urL

# wget will now download this webpage in the file named webpage.txt
# -O option is used to concate the content of the url to the file mentioned.
wget -O webpage.txt "$urL"

# Now we will apply  stream editor to filter the url from the file.
sed -n 's/.*href="\([^"]*\).*/\1/p' webpage.txt

Give permission to the file:

To execute a file using terminal we first make it executable by changing the accessibility modes of the file. Here 777 represents read, write, and executable. There are some other permissions that could be used to limit the files.

# chmod 777 returnLinks.sh

Now execute the shell script and give the URL:

# ./returnLinks.sh
Shell script output all the Internal and External links from the website url

shell script returns links

You can also store this in an external file as well:

The script will be the same; only the output redirection will be added to the stream editor command so that the output can be stored in the file.

#!/bin/bash
#Give the  url
read urL
#wget will now download this webpage in the file named webpage.txt
wget -O webpage.txt "$urL"
#Now we will apply  stream editor to filter the url from the file.
# here we will use output redirection to a text file. All the other code is same.
sed -n 's/.*href="\([^"]*\).*/\1/p' webpage.txt  > links.txt  

Shell script output all the Internal and External links from the website url

Now open the file links.txt

We will now open the file and see if all the links are present in the file or not.

# cat links.txt

Shell script output all the Internal and External links from the website url

My Personal Notes arrow_drop_up
Recommended Articles
Page :