Skip to content
Related Articles

Related Articles

Improve Article

Shell Script to Scrap the Definition of a Word From Wikipedia

  • Last Updated : 05 Jul, 2021

Web Scrapping is quite an interesting and powerful tool or skill to have in a Programmer’s toolkit.  It helps in analyzing data and getting some information in various formats. Web Scraping is a process in which a user fetches a website’s content using some pattern in those HTML tags and the desired content to be fetched or scraped.

For this article, we aim to fetch the meaning of a word entered by the user from the Wikipedia website. We need to print just the meaning of the word from the HTML tags in it. For doing all of these we must have a good understanding of HTML and some basic Linux tools such as cURL, grep, sed and others. 

Inspecting the Target Website:

To begin with, scrapping the website, first, it is absolutely important to inspect the website and view its source code. For that, we can make use of Inspect tool in our Browsers. Just Right-click on the website you are viewing or the website for scraping, a list of options appears in front of you. You have to select Inspect option(also Shift + control + I), this will open a side window with a plethora of options. You simply have to select Elements from the top of the menus. The code that you will see is the source code of the website. No, don’t think you can change the content of the website from here.

Shell Script to Scra the Definition of a Word From Wikipedia

Now we have to analyze the website with the content which we want to scrape. You can go on for clicking the “select the element in the page to inspect it “option or icon in the top left-hand side corner. This will allow you to inspect the particular element that you selected on the webpage. You can now see the element tag, id, class, and other attributes required to fetch the element’s content.



Shell Script to Scra the Definition of a Word From Wikipedia

Accessing the Website from Command Line/Terminal:

Now the website structure being understood we can actually move to scrap it. For that, we need to have the website’s content on our local machine. First, we need to access the website from elsewhere not from the browser, because you cannot copy-paste content from there. So let’s use Command Line here. We have a popular tool known as cURL, which stands for client URL. The tool fetches the contents of the provided URL. It also has several parameters or arguments that can be used to modify its output. We can use the command

$ curl -o output.txt https://en.wikipedia.org/wiki/Data

Example:

Shell Script to Scra the Definition of a Word From Wikipedia

The above command fetches the HTML page for the word Computer, it could be any word you might be searching for.

Now, we have to filter the tag as we show in the above figure, here we have used Regex to remove <> tags from the file and hence anything in between these is also removed and we get only pure text but it may also contain special characters and symbols. To remove that we’ll again use grep and filter the fine meaning in our file.

cat output.txt | grep "<p>" | sed 's/<[^>]*>//g'

Shell Script to Scra the Definition of a Word From Wikipedia

Making the Shell Script:

#!/bin/bash


if [ $# -ne 1 ]; then
  echo "Usage: $(basename $0) 'word '"
  exit 1
fi

curl=$(which curl)
outfile="output.txt"
word=$(echo $1)
url="https://en.wikipedia.org/wiki/$word"
echo $url

curl -o "output.txt" $url

function strip_html(){
    grep "<p>" $outfile | sed 's/<[^>]*>//g' > temp.txt && cp temp.txt $outfile
}

function res(){
    echo "Answer"
    while read result; do
        echo "${result}"
    done < $outfile
}

strip_html
res

Output:

Shell Script to Scra the Definition of a Word From Wikipedia

My Personal Notes arrow_drop_up
Recommended Articles
Page :