Open In App

BeautifulSoup – Parsing only section of a document

Improve
Improve
Like Article
Like
Save
Share
Report

BeautifulSoup is a Python Module used to find specific website contents/tags from a scraped website which can be scraped by any module like requests or scrapy. Remember BeautifulSoup doesn’t scrape a website but processes and displays the scraped contents by other modules in a readable form. So, to understand how we can scrape the data on a website, we would understand it by example.

Modules needed

First, we need to install all these modules on our computer.

  • BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.
pip install bs4
  • lxml: Helper library to process webpages in python language.
pip install lxml
  • requests: Makes the process of sending HTTP requests flawless.the output of the function
pip install requests

Let’s start by scraping  a sample website and see how to scrape a section of the page only.

Step 1: We import our beautifulsoup module and requests. We declared Header and added a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally gets blocked by them. 

Python3




from bs4 import BeautifulSoup,SoupStrainer 
import requests 
  
  
HEADERS = ({'User-Agent'
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36\
        (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'}) 
  
webpage = requests.get(URL, headers= HEADERS) 


Step 2: Now we use SoupStrainer which we imported in the first line to filter out only the parts of the website we need. Note that the underscore after the class would not be there in case we filter out the elements based on their id s. In this case, we just want to parse elements that have a class attribute of “mw-headline”. The last line prints the content parsed in a pretty manner.

SoupStrainer class tell which part to extract and the parse tree consists of these elements only. We just have to pass the SoupStrainer Object to the BeautifulSoup constructor as the parse_only argument.

Python3




soup = BeautifulSoup(webpage.content, "lxml",
                     parse_only = SoupStrainer(
                       'span', class_ = 'mw-headline'))
  
print(soup.prettify())


Complete Code:

Python3




from bs4 import BeautifulSoup,SoupStrainer 
import requests 
  
  
  
HEADERS = ({'User-Agent'
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
        (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'}) 
  
webpage = requests.get(URL, headers= HEADERS) 
soup = BeautifulSoup(webpage.content, "lxml"
                     parse_only = SoupStrainer(
                       'span', class_ = 'mw-headline'))
  
print(soup.prettify())


Output:

bs4 soupstrainer



Last Updated : 16 Mar, 2021
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads