BeautifulSoup – Parsing only section of a document
BeautifulSoup is a Python Module used to find specific website contents/tags from a scraped website which can be scraped by any module like requests or scrapy. Remember BeautifulSoup doesn’t scrape a website but processes and displays the scraped contents by other modules in a readable form. So, to understand how we can scrape the data on a website, we would understand it by example.
First, we need to install all these modules on our computer.
- BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.
pip install bs4
- lxml: Helper library to process webpages in python language.
pip install lxml
- requests: Makes the process of sending HTTP requests flawless.the output of the function
pip install requests
Let’s start by scraping a sample website and see how to scrape a section of the page only.
Step 1: We import our beautifulsoup module and requests. We declared Header and added a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally gets blocked by them.
Step 2: Now we use SoupStrainer which we imported in the first line to filter out only the parts of the website we need. Note that the underscore after the class would not be there in case we filter out the elements based on their id s. In this case, we just want to parse elements that have a class attribute of “mw-headline”. The last line prints the content parsed in a pretty manner.
SoupStrainer class tell which part to extract and the parse tree consists of these elements only. We just have to pass the SoupStrainer Object to the BeautifulSoup constructor as the parse_only argument.