Skip to content
Related Articles

Related Articles

Improve Article
BeautifulSoup – Parsing only section of a document
  • Last Updated : 16 Mar, 2021

BeautifulSoup is a Python Module used to find specific website contents/tags from a scraped website which can be scraped by any module like requests or scrapy. Remember BeautifulSoup doesn’t scrape a website but processes and displays the scraped contents by other modules in a readable form. So, to understand how we can scrape the data on a website, we would understand it by example.

Modules needed

First, we need to install all these modules on our computer.

  • BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.
pip install bs4
  • lxml: Helper library to process webpages in python language.
pip install lxml
  • requests: Makes the process of sending HTTP requests flawless.the output of the function
pip install requests

Let’s start by scraping  a sample website and see how to scrape a section of the page only.

Step 1: We import our beautifulsoup module and requests. We declared Header and added a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally gets blocked by them. 

Python3






from bs4 import BeautifulSoup,SoupStrainer 
import requests 
  
  
HEADERS = ({'User-Agent'
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36\
        (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'}) 
  
webpage = requests.get(URL, headers= HEADERS) 

Step 2: Now we use SoupStrainer which we imported in the first line to filter out only the parts of the website we need. Note that the underscore after the class would not be there in case we filter out the elements based on their id s. In this case, we just want to parse elements that have a class attribute of “mw-headline”. The last line prints the content parsed in a pretty manner.

SoupStrainer class tell which part to extract and the parse tree consists of these elements only. We just have to pass the SoupStrainer Object to the BeautifulSoup constructor as the parse_only argument.

Python3




soup = BeautifulSoup(webpage.content, "lxml",
                     parse_only = SoupStrainer(
                       'span', class_ = 'mw-headline'))
  
print(soup.prettify())

Complete Code:

Python3




from bs4 import BeautifulSoup,SoupStrainer 
import requests 
  
  
  
HEADERS = ({'User-Agent'
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
        (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'}) 
  
webpage = requests.get(URL, headers= HEADERS) 
soup = BeautifulSoup(webpage.content, "lxml"
                     parse_only = SoupStrainer(
                       'span', class_ = 'mw-headline'))
  
print(soup.prettify())

Output:

bs4 soupstrainer

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :