BeautifulSoup – Parsing only section of a document

Last Updated : 16 Mar, 2021

BeautifulSoup is a Python Module used to find specific website contents/tags from a scraped website which can be scraped by any module like requests or scrapy. Remember BeautifulSoup doesn’t scrape a website but processes and displays the scraped contents by other modules in a readable form. So, to understand how we can scrape the data on a website, we would understand it by example.

Modules needed

First, we need to install all these modules on our computer.

BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.

pip install bs4

lxml: Helper library to process webpages in python language.

pip install lxml

requests: Makes the process of sending HTTP requests flawless.the output of the function

pip install requests

Let’s start by scraping a sample website and see how to scrape a section of the page only.

Step 1: We import our beautifulsoup module and requests. We declared Header and added a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally gets blocked by them.

Python3

from bs4 import BeautifulSoup,SoupStrainer  
import requests  
  
URL = "https://en.wikipedia.org/wiki/Nike,_Inc."
  
HEADERS = ({'User-Agent':  
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36\ 
        (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', 
            'Accept-Language': 'en-US, en;q=0.5'})  
  
webpage = requests.get(URL, headers= HEADERS)  

Step 2: Now we use SoupStrainer which we imported in the first line to filter out only the parts of the website we need. Note that the underscore after the class would not be there in case we filter out the elements based on their id s. In this case, we just want to parse elements that have a class attribute of “mw-headline”. The last line prints the content parsed in a pretty manner.

SoupStrainer class tell which part to extract and the parse tree consists of these elements only. We just have to pass the SoupStrainer Object to the BeautifulSoup constructor as the parse_only argument.

Python3

soup = BeautifulSoup(webpage.content, "lxml", 
                     parse_only = SoupStrainer( 
                       'span', class_ = 'mw-headline')) 
  
print(soup.prettify()) 

Complete Code:

Python3

from bs4 import BeautifulSoup,SoupStrainer  
import requests  
  
  
URL = "https://en.wikipedia.org/wiki/Nike,_Inc."
  
HEADERS = ({'User-Agent':  
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \ 
        (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36', 
            'Accept-Language': 'en-US, en;q=0.5'})  
  
webpage = requests.get(URL, headers= HEADERS)  
soup = BeautifulSoup(webpage.content, "lxml",  
                     parse_only = SoupStrainer( 
                       'span', class_ = 'mw-headline')) 
  
print(soup.prettify())