Open In App

Reading selected webpage content using Python Web Scraping

Improve
Improve
Like Article
Like
Save
Share
Report

Prerequisite: Downloading files in Python, Web Scraping with BeautifulSoup

We all know that Python is a very easy programming language but what makes it cool are the great number of open source library written for it. Requests is one of the most widely used library. It allows us to open any HTTP/HTTPS website and let us do any kind of stuff we normally do on web and can also save sessions i.e cookie.
As we all know that a webpage is just a piece of HTML code which is sent by the Web Server to our Browser, which in turn converts into the beautiful page. Now we need a mechanism to get hold of the HTML source code i.e finding some particular tags with a package called BeautifulSoup.
Installation:

pip3 install requests
pip3 install beautifulsoup4

We take an example by reading a news site Hindustan Times

The code can be divided into three parts.

  • Requesting a webpage
  • Inspecting the tags
  • Print the appropriate contents

Steps:

  1. Requesting a webpage: First we see right click on the news text to see the source code 1
  2. Inspecting the tags: We need to figure in which body of the source code contains the news section we want to scrap. It is the under ul,i.e unordered list, “searchNews” which contains the news section.

    2

    Note The news text is present in the anchor tag text part. A close observation gives us the idea that all the news are in li, list, tags of the unordered tag.

    3

  3. Print the appropriate contents: The content is printed with the help of code given below.




    import requests
    from bs4 import BeautifulSoup
      
    def news():
        # the target we want to open    
          
        #open with GET method
        resp=requests.get(url)
          
        #http_respone 200 means OK status
        if resp.status_code==200:
            print("Successfully opened the web page")
            print("The news are as follow :-\n")
          
            # we need a parser,Python built-in HTML parser is enough .
            soup=BeautifulSoup(resp.text,'html.parser')    
      
            # l is the list which contains all the text i.e news 
            l=soup.find("ul",{"class":"searchNews"})
          
            #now we want to print only the text part of the anchor.
            #find all the elements of a, i.e anchor
            for i in l.findAll("a"):
                print(i.text)
        else:
            print("Error")
              
    news()

    
    

    Output

    Successfully opened the web page
    The news are as follow :-
    Govt extends toll tax suspension, use of old notes for utility bills extended till Nov 14
    Modi, Abe seal historic civil nuclear pact: What it means for India
    Rahul queues up at bank, says it is to show solidarity with common man
    IS kills over 60 in Mosul, victims dressed in orange and marked 'traitors'
    Rock On 2 review: Farhan Akhtar, Arjun Rampal's band hasn't lost its magic
    Rumours of shortage in salt supply spark panic among consumers in UP
    Worrying truth: India ranks first in pneumonia, diarrhoea deaths among kids
    To hell with romance, here's why being single is the coolest way to be
    India vs England: Cheteshwar Pujara, Murali Vijay make merry with tons in Rajkot
    Akshay-Bhumi, SRK-Alia, Ajay-Parineeti: Age difference doesn't matter anymore
    Currency ban: Only one-third have bank access; NE, backward regions worst hit
    Nepal's central bank halts transactions with Rs 500, Rs 1000 Indian notes
    Political upheaval in Punjab after SC tells it to share Sutlej water
    Let's not kid ourselves, with Trump, what we have seen is what we will get
    Want to colour your hair? Try rose gold, the hottest hair trend this winter
    

References



Last Updated : 11 Jul, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads