How to Enable Shell Script Debugging Mode in Linux?

Shell Scripting - Set Command

Shell Script to Remove Temporary Files

Shell Script to Delete the Zero Sized File Using If and For

Shell Script to Validate Integer Input

Shell Script to Measure Size of a File

Shell Script to Convert a File Content to Lower Case or Upper Case

Shell Scripting - JOB SPEC & Command

Shell Script to Take a Screenshot

Shell Scripting - Shell Signals Values

Shell Script To Show Names of All Sub-Directories Present in Current Directory

Shell Scripting - Bash Trap Command

Shell Scripting - Logout Command

Shell Script to Split a String

Shell Script to Concatenate Two Strings

Different Ways to Check Which Shell You are Using on Linux

Shell Script to Join a String

Shell Script to List all Hidden Files in Current Directory

Shell Scripting - How to view Processes?

Shell Script to Scrap the Definition of a Word From Wikipedia

Last Updated : 09 Jan, 2023

Web Scraping is quite an interesting and powerful tool or skill to have in a Programmer’s toolkit. It helps in analyzing data and getting some information in various formats. Web Scraping is a process in which a user fetches a website’s content using some pattern in those HTML tags and the desired content to be fetched or scraped.

For this article, we aim to fetch the meaning of a word entered by the user from the Wikipedia website. We need to print just the meaning of the word from the HTML tags in it. For doing all of these we must have a good understanding of HTML and some basic Linux tools such as cURL, grep, sed and others.

Inspecting the Target Website:

To begin with, scraping the website, first, it is absolutely important to inspect the website and view its source code. For that, we can make use of Inspect tool in our Browsers. Just Right-click on the website you are viewing or the website for scraping, a list of options appears in front of you. You have to select Inspect option(also Shift + control + I), this will open a side window with a plethora of options. You simply have to select Elements from the top of the menus. The code that you will see is the source code of the website. No, don’t think you can change the content of the website from here.

Shell Script to Scra the Definition of a Word From Wikipedia

Now we have to analyze the website with the content which we want to scrape. You can go on for clicking the “select the element in the page to inspect it “option or icon in the top left-hand side corner. This will allow you to inspect the particular element that you selected on the webpage. You can now see the element tag, id, class, and other attributes required to fetch the element’s content.

Shell Script to Scra the Definition of a Word From Wikipedia

Accessing the Website from Command Line/Terminal:

Now the website structure being understood we can actually move to scrap it. For that, we need to have the website’s content on our local machine. First, we need to access the website from elsewhere not from the browser, because you cannot copy-paste content from there. So let’s use Command Line here. We have a popular tool known as cURL, which stands for client URL. The tool fetches the contents of the provided URL. It also has several parameters or arguments that can be used to modify its output. We can use the command

$ curl -o output.txt https://en.wikipedia.org/wiki/Data

Example:

Shell Script to Scra the Definition of a Word From Wikipedia

The above command fetches the HTML page for the word Computer, it could be any word you might be searching for.

Now, we have to filter the tag as we show in the above figure, here we have used Regex to remove <> tags from the file and hence anything in between these is also removed and we get only pure text but it may also contain special characters and symbols. To remove that we’ll again use grep and filter the fine meaning in our file.

cat output.txt | grep "<p>" | sed 's/<[^>]*>//g'

Shell Script to Scra the Definition of a Word From Wikipedia

Making the Shell Script:

#!/bin/bash


if [ $# -ne 1 ]; then
  echo "Usage: $(basename $0) 'word '"
  exit 1
if

curl=$(which curl)
outfile="output.txt"
word=$(echo $1)
url="https://en.wikipedia.org/wiki/$word"
echo $url

curl -o "output.txt" $url

function strip_html(){
    grep "<p>" $outfile | sed 's/<[^>]*>//g' > temp.txt && cp temp.txt $outfile
}

function res(){
    echo "Answer"
    while read result; do
        echo "${result}"
    done < $outfile
}

strip_html
res

Output:

Shell Script to Scra the Definition of a Word From Wikipedia

Suggest improvement

Shell Scripting - How to view Processes?

Packet sniffing using Scapy

Share your thoughts in the comments

Similar Reads

Shell Script to Show the Difference Between echo “$SHELL” and echo ‘$SHELL’

Bash Script - Difference between Bash Script and Shell Script

Shell Scripting - Difference between Korn Shell and Bash shell

Shell Script to Demonstrate the Use of Shell Function Library

How to write a shell script that starts tmux session, and then runs a ruby script

Shell Script to Display Time in Word

Shell Scripting - Restricted Shell

Auto Logout in Linux Shell Using TMOUT Shell Variable

Shell Scripting - Interactive and Non-Interactive Shell

Shell Scripting - Default Shell Variable Value

M

meetgor

Article Tags :