Shell Script To Show All the Internal and External Links From a URL
To make hierarchy among the webpages or the information web developers use to connect them all is called webpage linking. There are two types of webpage linking: one is internal linking, and the other is external linking. Internal links are those which link a page available on the same website to produce a cycle on the site. At the same time, external links are those which link to another website or domain. External links play a vital role in ranking a website on the search engine. Improvement in the website rank can be seen by increasing the number of external links to your website. Here we are asked to code a shell script that could print all these links on the terminal screen. The only input provided to the script is the URL of the webpage for which we need to fetch all the links.
Note: A website can be accessed in two ways: one in using a web browser, and the other is using terminal commands which follow limited protocols to access the website. Terminal commands have some limitations, so we will also use a terminal-based web browser, which will help us to connect to that website.
For the command line, we are going to use the tool “lynx”. Lynx is a terminal-based web browser that did not show images and other multimedia content to make it much faster than other browsers.
# sudo apt install lynx -y
Let us see the GeeksForGeeks project page links. But before we must understand the options present in the lynx browser.
- -dump: This will dump the formatted output of the document.
- -listonly: This will list all the links present on the URL mentioned. This used with -dump.
Now apply these options:
# lynx -dump -listonly https://www.geeksforgeeks.org/computer-science-projects/?ref=shm
Or redirect this terminal output to any text file:
# lynx -dump -listonly https://www.geeksforgeeks.org/computer-science-projects/?ref=shm > links.txt
Now see the links using cat commands:
# cat links.txt
We could easily do all the work done above in a file using a scripting language, and it would be much easier and enjoyable as well. There are different ways to get the links, like regex. We will use regex with the “sed” command. First, we will download the webpage as text and then apply the regular expression on the text file.
Now we will create a file using the nano editor. Code explanation is given below.
# nano returnLinks.sh
Below is the implementation:
#!/bin/bash # Give the url read urL # wget will now download this webpage in the file named webpage.txt # -O option is used to concate the content of the url to the file mentioned. wget -O webpage.txt "$urL" # Now we will apply stream editor to filter the url from the file. sed -n 's/.*href="\([^"]*\).*/\1/p' webpage.txt
Give permission to the file:
To execute a file using terminal we first make it executable by changing the accessibility modes of the file. Here 777 represents read, write, and executable. There are some other permissions that could be used to limit the files.
# chmod 777 returnLinks.sh
Now execute the shell script and give the URL:
You can also store this in an external file as well:
The script will be the same; only the output redirection will be added to the stream editor command so that the output can be stored in the file.
#!/bin/bash #Give the url read urL #wget will now download this webpage in the file named webpage.txt wget -O webpage.txt "$urL" #Now we will apply stream editor to filter the url from the file. # here we will use output redirection to a text file. All the other code is same. sed -n 's/.*href="\([^"]*\).*/\1/p' webpage.txt > links.txt
Now open the file links.txt
We will now open the file and see if all the links are present in the file or not.
# cat links.txt