Perquisites: Web scrapping using Beautiful soup, XML Parsing
Scraping is a very essential skill that everybody should learn, It helps us to scrap data from a website or a file that can be used in another beautiful manner by the programmer. In this article, we will learn how to Extract a Table from a website and XML from a file.
Here, we will scrap data using the Beautiful Soup Python Module.
Modules Required:
- bs4: Beautiful Soup is a Python library for pulling data out of HTML and XML files. It can be installed using the below command:
pip install bs4
- lxml: It is a Python library that allows us to handle XML and HTML files. It can be installed using the below command:
pip install lxml
- request: Requests allows you to send HTTP/1.1 requests extremely easily. It can be installed using the below command:
pip install request
Step-by-step Approach to parse Tables:
Step 1: Firstly, we need to import modules and then assign the URL.
Python3
# import required modules import bs4 as bs import requests # assign URL |
Step 2: Create a BeautifulSoap object for parsing.
Python3
# parsing url_link = requests.get(URL) file = bs.BeautifulSoup(url_link.text, "lxml" ) |
Step 3: Then find all the table and its rows.
Python3
# find all tables find_table = file .find( 'table' ) rows = find_table.find_all( 'tr' ) |
Step 4: Now create a loop to find all the td tags in the table and then print all the table data tags.
Python3
# display tables for i in table_rows: table_data = i.find_all( 'td' ) data = [j.text for j in table_data] print (data) |
Below is the complete program based on the above approach:
Python3
# import required modules import bs4 as bs import requests # assign URL # parsing url_link = requests.get(URL) file = bs.BeautifulSoup(url_link.text, "lxml" ) # find all tables find_table = file .find( 'table' ) rows = find_table.find_all( 'tr' ) # display tables for i in table_rows: table_data = i.find_all( 'td' ) data = [j.text for j in table_data] print (data) |
Output:
Step-by-step Approach to parse XML files:
Step 1: Before moving on, You can create your own ‘xml file’ or you can just copy and paste below code, and name it as test1.xml file on your system.
<?xml version="1.0" ?> <books> <book> <title>Introduction of Geeksforgeeks V1</title> <author>Gfg</author> <price>6.99</price> </book> <book> <title>Introduction of Geeksforgeeks V2</title> <author>Gfg</author> <price>8.99</price> </book> <book> <title>Introduction of Geeksforgeeks V2</title> <author>Gfg</author> <price>9.35</price> </book> </books>
Step 2: Create a python file and import modules.
Python3
# import required modules from bs4 import BeautifulSoup |
Step 3: Read the content of the XML.
Python3
# reading content file = open ( "test1.xml" , "r" ) contents = file .read() |
Step 4: Parse the content of the XML.
Python3
# parsing soup = BeautifulSoup(contents, 'xml' ) titles = soup.find_all( 'title' ) |
Step 5: Display the content of the XML file.
Python3
# parsing soup = BeautifulSoup(contents, 'xml' ) titles = soup.find_all( 'title' ) |
Below is the complete program based on the above approach:
Python3
# import required modules from bs4 import BeautifulSoup # reading content file = open ( "test1.xml" , "r" ) contents = file .read() # parsing soup = BeautifulSoup(contents, 'xml' ) titles = soup.find_all( 'title' ) # display content for data in titles: print (data.get_text()) |
Output:
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.