Open In App

BeautifulSoup – Parsing only section of a document

Last Updated : 16 Mar, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

BeautifulSoup is a Python Module used to find specific website contents/tags from a scraped website which can be scraped by any module like requests or scrapy. Remember BeautifulSoup doesn’t scrape a website but processes and displays the scraped contents by other modules in a readable form. So, to understand how we can scrape the data on a website, we would understand it by example.

Modules needed

First, we need to install all these modules on our computer.

  • BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.
pip install bs4
  • lxml: Helper library to process webpages in python language.
pip install lxml
  • requests: Makes the process of sending HTTP requests flawless.the output of the function
pip install requests

Let’s start by scraping  a sample website and see how to scrape a section of the page only.

Step 1: We import our beautifulsoup module and requests. We declared Header and added a user agent. This ensures that the target website we are going to web scrape doesn’t consider traffic from our program as spam and finally gets blocked by them. 

Python3




from bs4 import BeautifulSoup,SoupStrainer 
import requests 
  
  
HEADERS = ({'User-Agent'
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36\
        (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'}) 
  
webpage = requests.get(URL, headers= HEADERS) 


Step 2: Now we use SoupStrainer which we imported in the first line to filter out only the parts of the website we need. Note that the underscore after the class would not be there in case we filter out the elements based on their id s. In this case, we just want to parse elements that have a class attribute of “mw-headline”. The last line prints the content parsed in a pretty manner.

SoupStrainer class tell which part to extract and the parse tree consists of these elements only. We just have to pass the SoupStrainer Object to the BeautifulSoup constructor as the parse_only argument.

Python3




soup = BeautifulSoup(webpage.content, "lxml",
                     parse_only = SoupStrainer(
                       'span', class_ = 'mw-headline'))
  
print(soup.prettify())


Complete Code:

Python3




from bs4 import BeautifulSoup,SoupStrainer 
import requests 
  
  
  
HEADERS = ({'User-Agent'
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
        (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'}) 
  
webpage = requests.get(URL, headers= HEADERS) 
soup = BeautifulSoup(webpage.content, "lxml"
                     parse_only = SoupStrainer(
                       'span', class_ = 'mw-headline'))
  
print(soup.prettify())


Output:

bs4 soupstrainer



Similar Reads

Parsing tables and XML with BeautifulSoup
Scraping is a very essential skill that everybody should learn, It helps us to scrap data from a website or a file that can be used in another beautiful manner by the programmer. In this article, we will learn how to extract tables with beautiful soup and XML from a file. Here, we will scrap data using the Beautiful Soup Python Module. Perquisites:
4 min read
BeautifulSoup object - Python Beautifulsoup
BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. Syntax: BeautifulS
2 min read
Find the title tags from a given html document using BeautifulSoup in Python
Let's see how to Find the title tags from a given html document using BeautifulSoup in python. so we can find the title tag from html document using BeautifulSoup find() method. The find function takes the name of the tag as string input and returns the first found match of the particular tag from the webpage. Example 1: Python Code # import Beauti
1 min read
Find the tag with a given attribute value in an HTML document using BeautifulSoup
Prerequisites: Beautifulsoup In this article, we will discuss how beautifulsoup can be employed to find a tag with the given attribute value in an HTML document. Approach: Import module.Scrap data from a webpage.Parse the string scraped to HTML.Use find() function to find the attribute and tag.Print the result. Syntax: find(attr_name="value") Below
1 min read
NLP | Partial parsing with Regex
Defining a grammar to parse 3 phrase types. ChunkRule class that looks for an optional determiner followed by one or more nouns is used for noun phrases. To add an adjective to the front of a noun chunk, MergeRule class is used. Any IN word is simply chunked for the prepositional phrases. an optional modal word (such as should) followed by a verb i
2 min read
Parsing XML with DOM APIs in Python
The Document Object Model (DOM) is a programming interface for HTML and XML(Extensible markup language) documents. It defines the logical structure of documents and the way a document is accessed and manipulated. Parsing XML with DOM APIs in python is pretty simple. For the purpose of example we will create a sample XML document (sample.xml) as bel
2 min read
Argparse VS Docopt VS Click - Comparing Python Command-Line Parsing Libraries
Before knowing about the Python parsing libraries we must have prior knowledge about Command Line User Interface. A Command Line Interface (CLI) provides a user-friendly interface for the Command-line programs, which is most commonly favored by the developers or the programmers who prefer keyboard programming, instead of using the mouse. By buildin
4 min read
Parsing PDFs in Python with Tika
Apache Tika is a library that is used for document type detection and content extraction from various file formats. Using this, one can develop a universal type detector and content extractor to extract both structured text and metadata from different types of documents such as spreadsheets, text documents, images, PDF's, and even multimedia input
2 min read
Parsing and Processing URL using Python - Regex
Prerequisite: Regular Expression in Python URL or Uniform Resource Locator consists of many information parts, such as the domain name, path, port number etc. Any URL can be processed and parsed using Regular Expression. So for using Regular Expression we have to use re library in Python. Example: URL: https://www.geeksforgeeks.org/courses When we
3 min read
Parsing DateTime strings containing microseconds in Python
Most of the applications require a precision of up to seconds but there are also some critical applications that require nanosecond precision, especially the ones which can perform extremely fast computations. It can help provide insights on certain factors related to time space for the application. Let us see how can we parse DateTime strings that
3 min read
Article Tags :
Practice Tags :