Parsing tables and XML with BeautifulSoup
Scraping is a very essential skill that everybody should learn, It helps us to scrap data from a website or a file that can be used in another beautiful manner by the programmer. In this article, we will learn how to extract tables with beautiful soup and XML from a file. Here, we will scrap data using the Beautiful Soup Python Module.
Perquisites: Â
Modules Required
- bs4: Beautiful Soup is a Python library for pulling data out of HTML and XML files.
- lxml: It is a Python library that allows us to handle XML and HTML files.
- requests: It allows you to send HTTP/1.1 requests extremely easily.
pip install bs4
pip install lxml
pip install request
Extract Tables With BeautifulSoup in Python
Below are the steps in which we will see how to extract tables with beautiful soup in Python:
Step 1: Import the Library and Define Target URL
Firstly, we need to import modules and then assign the URL.
Python3
import bs4 as bs
import requests
|
Step 2: Create Object for Parsing
In this step, we are creating a BeautifulSoup Object for parsing and further executions of extracting the tables.
Python3
url_link = requests.get(URL)
file = bs.BeautifulSoup(url_link.text, "lxml" )
|
Step 3: Locating and Extracting Table Data
In this step, we are finding the table and its rows.Â
Python3
find_table = file .find( 'table' , class_ = 'numpy-table' )
rows = find_table.find_all( 'tr' )
|
Step 4: Extracting Text from Table Cell
Now create a loop to find all the td tags in the table and then print all the table data tags.
Python3
for i in rows:
table_data = i.find_all( 'td' )
data = [j.text for j in table_data]
print (data)
|
Complete Code
Below is the complete implementation of the above steps. In this code, we’re scraping a specific table (numpy-table
class) from a GeeksforGeeks page about Python lists. After locating the table rows, we iterate through each row to extract and print the cell data.
Python3
import bs4 as bs
import requests
url_link = requests.get(URL)
file = bs.BeautifulSoup(url_link.text, "lxml" )
find_table = file .find( 'table' , class_ = 'numpy-table' )
rows = find_table.find_all( 'tr' )
for i in rows:
table_data = i.find_all( 'td' )
data = [j.text for j in table_data]
print (data)
|
Output:
Parsing and Extracting XML files With BeautifulSoup
Below are the steps by which we can parse the XML files using BeautifulSoup in Python:
Step 1: Creating XML File
test1.xml: Before moving on, You can create your own ‘xml file’ or you can just copy and paste below code, and name it as test1.xml file on your system.
<?xml version="1.0" ?>
<books>
<book>
<title>Introduction of Geeksforgeeks V1</title>
<author>Gfg</author>
<price>6.99</price>
</book>
<book>
<title>Introduction of Geeksforgeeks V2</title>
<author>Gfg</author>
<price>8.99</price>
</book>
<book>
<title>Introduction of Geeksforgeeks V2</title>
<author>Gfg</author>
<price>9.35</price>
</book>
</books>
Step 2: Creating a Python File
In this step, we will create a Python file and start writing our code. Now we will import modules.
Python3
from bs4 import BeautifulSoup
|
Step 3: Reading the XML Content
In this step, we will read the content of the XML.
Python3
file = open ( "test1.xml" , "r" )
contents = file .read()
|
Step 4: Parse the Content of the XML
In this step, we will parse the content of the XML.
Python3
soup = BeautifulSoup(contents, 'xml' )
titles = soup.find_all( 'title' )
|
Step 5: Display the Content
In this step, we will display the content of the XML file.
Python3
soup = BeautifulSoup(contents, 'xml' )
titles = soup.find_all( 'title' )
|
Complete Code
Below is the implementation of above steps. In this code, we’re reading an XML file named “test1.xml” and parsing its content using BeautifulSoup with the XML parser. We then extract all <title>
tags from the XML and print their text content.
Python3
from bs4 import BeautifulSoup
file = open ( "test1.xml" , "r" )
contents = file .read()
soup = BeautifulSoup(contents, 'xml' )
titles = soup.find_all( 'title' )
for data in titles:
print (data.get_text())
|
Output:
Last Updated :
12 Jan, 2024
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...