Skip to content
Related Articles

Related Articles

Parsing tables and XML with BeautifulSoup
  • Last Updated : 26 Nov, 2020

Perquisites:  Web scrapping using Beautiful soup, XML Parsing

Scraping is a very essential skill that everybody should learn, It helps us to scrap data from a website or a file that can be used in another beautiful manner by the programmer. In this article, we will learn how to Extract a Table from a website and XML from a file.
Here, we will scrap data using the Beautiful Soup Python Module.

Modules Required:

  • bs4: Beautiful Soup is a Python library for pulling data out of HTML and XML files. It can be installed using the below command:
pip install bs4
  • lxml: It is a Python library that allows us to handle XML and HTML files. It can be installed using the below command:
pip install lxml
  • request: Requests allows you to send HTTP/1.1 requests extremely easily. It can be installed using the below command:
pip install request

Step-by-step Approach to parse Tables:

Step 1: Firstly, we need to import modules and then assign the URL.

Python3



filter_none

edit
close

play_arrow

link
brightness_4
code

# import required modules
import bs4 as bs
import requests
  
# assign URL

chevron_right


Step 2: Create a BeautifulSoap object for parsing.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# parsing
url_link = requests.get(URL)
file = bs.BeautifulSoup(url_link.text, "lxml")

chevron_right


Step 3: Then find all the table and its rows. 

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# find all tables
find_table = file.find('table')
rows = find_table.find_all('tr')

chevron_right


Step 4: Now create a loop to find all the td tags in the table and then print all the table data tags.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# display tables
for i in table_rows:
    table_data = i.find_all('td')
    data = [j.text for j in table_data]
    print(data)

chevron_right


Below is the complete program based on the above approach:

Python3



filter_none

edit
close

play_arrow

link
brightness_4
code

# import required modules
import bs4 as bs
import requests
  
# assign URL
  
# parsing
url_link = requests.get(URL)
file = bs.BeautifulSoup(url_link.text, "lxml")
  
# find all tables
find_table = file.find('table')
rows = find_table.find_all('tr')
  
# display tables
for i in table_rows:
    table_data = i.find_all('td')
    data = [j.text for j in table_data]
    print(data)

chevron_right


Output:

Step-by-step Approach to parse XML files:

Step 1: Before moving on, You can create your own ‘xml file’ or you can just copy and paste below code, and name it as test1.xml file on your system.

<?xml version="1.0" ?>
<books>
  <book>
    <title>Introduction of Geeksforgeeks V1</title>
    <author>Gfg</author>
    <price>6.99</price>
  </book>
  <book>
    <title>Introduction of Geeksforgeeks V2</title>
    <author>Gfg</author>
    <price>8.99</price>
  </book>
  <book>
    <title>Introduction of Geeksforgeeks V2</title>
    <author>Gfg</author>
    <price>9.35</price>
  </book>
</books>

Step 2: Create a python file and import modules.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# import required modules
from bs4 import BeautifulSoup

chevron_right


Step 3: Read the content of the XML.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# reading content
file = open("test1.xml", "r")
contents = file.read()

chevron_right


Step 4: Parse the content of the XML.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')

chevron_right


Step 5: Display the content of the XML file.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')

chevron_right


Below is the complete program based on the above approach:

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# import required modules
from bs4 import BeautifulSoup
  
# reading content
file = open("test1.xml", "r")
contents = file.read()
  
# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')
  
# display content
for data in titles:
    print(data.get_text())

chevron_right


Output:


Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.

My Personal Notes arrow_drop_up
Recommended Articles
Page :