Open In App

BeautifulSoup – Parsing only section of a document

BeautifulSoup is a Python Module used to find specific website contents/tags from a scraped website which can be scraped by any module like requests or scrapy. Remember BeautifulSoup doesn’t scrape a website but processes and displays the scraped contents by other modules in a readable form. So, to understand how we can scrape the data on a website, we would understand it by example.

Modules needed

First, we need to install all these modules on our computer.



pip install bs4
pip install lxml
pip install requests

Let’s start by scraping  a sample website and see how to scrape a section of the page only.

Step 1: We import our beautifulsoup module and requests. We declared Header and added a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally gets blocked by them. 






from bs4 import BeautifulSoup,SoupStrainer 
import requests 
  
  
HEADERS = ({'User-Agent'
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36\
        (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'}) 
  
webpage = requests.get(URL, headers= HEADERS) 

Step 2: Now we use SoupStrainer which we imported in the first line to filter out only the parts of the website we need. Note that the underscore after the class would not be there in case we filter out the elements based on their id s. In this case, we just want to parse elements that have a class attribute of “mw-headline”. The last line prints the content parsed in a pretty manner.

SoupStrainer class tell which part to extract and the parse tree consists of these elements only. We just have to pass the SoupStrainer Object to the BeautifulSoup constructor as the parse_only argument.




soup = BeautifulSoup(webpage.content, "lxml",
                     parse_only = SoupStrainer(
                       'span', class_ = 'mw-headline'))
  
print(soup.prettify())

Complete Code:




from bs4 import BeautifulSoup,SoupStrainer 
import requests 
  
  
  
HEADERS = ({'User-Agent'
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
        (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'}) 
  
webpage = requests.get(URL, headers= HEADERS) 
soup = BeautifulSoup(webpage.content, "lxml"
                     parse_only = SoupStrainer(
                       'span', class_ = 'mw-headline'))
  
print(soup.prettify())

Output:


Article Tags :