Parsing tables and XML with BeautifulSoup

Last Updated : 12 Jan, 2024

Scraping is a very essential skill that everybody should learn, It helps us to scrap data from a website or a file that can be used in another beautiful manner by the programmer. In this article, we will learn how to extract tables with beautiful soup and XML from a file. Here, we will scrap data using the Beautiful Soup Python Module.

Perquisites:

Modules Required

bs4: Beautiful Soup is a Python library for pulling data out of HTML and XML files.
lxml: It is a Python library that allows us to handle XML and HTML files.
requests: It allows you to send HTTP/1.1 requests extremely easily.

pip install bs4
pip install lxml
pip install request

Extract Tables With BeautifulSoup in Python

Below are the steps in which we will see how to extract tables with beautiful soup in Python:

Step 1: Import the Library and Define Target URL

Firstly, we need to import modules and then assign the URL.

Python3

# import required modules
import bs4 as bs
import requests
 
# assign URL
URL = 'https://www.geeksforgeeks.org/python-list/'

Step 2: Create Object for Parsing

In this step, we are creating a BeautifulSoup Object for parsing and further executions of extracting the tables.

Python3

# parsing
url_link = requests.get(URL)
file = bs.BeautifulSoup(url_link.text, "lxml")

Step 3: Locating and Extracting Table Data

In this step, we are finding the table and its rows.

Python3

# find all tables
find_table = file.find('table', class_='numpy-table')
rows = find_table.find_all('tr')

Step 4: Extracting Text from Table Cell

Now create a loop to find all the td tags in the table and then print all the table data tags.

Python3

# display tables
for i in rows:
    table_data = i.find_all('td')
    data = [j.text for j in table_data]
    print(data)

Complete Code

Below is the complete implementation of the above steps. In this code, we’re scraping a specific table (numpy-table class) from a GeeksforGeeks page about Python lists. After locating the table rows, we iterate through each row to extract and print the cell data.

Python3

# import required modules
import bs4 as bs
import requests
 
# assign URL
URL = 'https://www.geeksforgeeks.org/python-list/'
 
# parsing
url_link = requests.get(URL)
file = bs.BeautifulSoup(url_link.text, "lxml")
 
# find all tables
find_table = file.find('table', class_='numpy-table')
rows = find_table.find_all('tr')
 
# display tables
for i in rows:
    table_data = i.find_all('td')
    data = [j.text for j in table_data]
    print(data)

Output:

Parsing and Extracting XML files With BeautifulSoup

Below are the steps by which we can parse the XML files using BeautifulSoup in Python:

Step 1: Creating XML File

test1.xml: Before moving on, You can create your own ‘xml file’ or you can just copy and paste below code, and name it as test1.xml file on your system.

<?xml version="1.0" ?>
<books>
  <book>
    <title>Introduction of Geeksforgeeks V1</title>
    <author>Gfg</author>
    <price>6.99</price>
  </book>
  <book>
    <title>Introduction of Geeksforgeeks V2</title>
    <author>Gfg</author>
    <price>8.99</price>
  </book>
  <book>
    <title>Introduction of Geeksforgeeks V2</title>
    <author>Gfg</author>
    <price>9.35</price>
  </book>
</books>

Step 2: Creating a Python File

In this step, we will create a Python file and start writing our code. Now we will import modules.

Python3

# import required modules
from bs4 import BeautifulSoup

Step 3: Reading the XML Content

In this step, we will read the content of the XML.

Python3

# reading content
file = open("test1.xml", "r")
contents = file.read()

Step 4: Parse the Content of the XML

In this step, we will parse the content of the XML.

Python3

# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')

Step 5: Display the Content

In this step, we will display the content of the XML file.

Python3

# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')

Complete Code

Below is the implementation of above steps. In this code, we’re reading an XML file named “test1.xml” and parsing its content using BeautifulSoup with the XML parser. We then extract all <title> tags from the XML and print their text content.

Python3

# import required modules
from bs4 import BeautifulSoup
 
# reading content
file = open("test1.xml", "r")
contents = file.read()
 
# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')
 
# display content
for data in titles:
    print(data.get_text())

Output:

Suggest improvement

Get all HTML tags with BeautifulSoup

Share your thoughts in the comments

Parsing tables and XML with BeautifulSoup

Extract Tables With BeautifulSoup in Python

Step 1: Import the Library and Define Target URL

Python3

Step 2: Create Object for Parsing

Python3

Step 3: Locating and Extracting Table Data

Python3

Step 4: Extracting Text from Table Cell

Python3

Complete Code

Python3

Parsing and Extracting XML files With BeautifulSoup

Step 1: Creating XML File

Step 2: Creating a Python File

Python3

Step 3: Reading the XML Content

Python3

Step 4: Parse the Content of the XML

Python3

Step 5: Display the Content

Python3

Complete Code

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?