Scrape Tables From any website using Python

Last Updated : 06 Aug, 2021

Scraping is a very essential skill for everyone to get data from any website. Scraping and parsing a table can be very tedious work if we use standard Beautiful soup parser to do so. Therefore, here we will be describing a library with the help of which any table can be scraped from any website easily. With this method you don’t even have to inspect element of a website, you only have to provide the URL of the website. That’s it and the work will be done within seconds.

Installation

You can use pip to install this library:

pip install html-table-parser-python3

Getting Started

Step 1: Import the necessary libraries required for the task

# Library for opening url and creating 
# requests
import urllib.request

# pretty-print python data structures
from pprint import pprint

# for parsing all the tables present 
# on the website
from html_table_parser.parser import HTMLTableParser

# for converting the parsed data in a
# pandas dataframe
import pandas as pd

Step 2 : Defining a function to get contents of the website

# Opens a website and read its
# binary contents (HTTP Response Body)
def url_get_contents(url):

    # Opens a website and read its
    # binary contents (HTTP Response Body)

    #making request to the website
    req = urllib.request.Request(url=url)
    f = urllib.request.urlopen(req)

    #reading contents of the website
    return f.read()

Now, our function is ready so we have to specify the url of the website from which we need to parse tables.

Note: Here we will be taking the example of moneycontrol.com website since it has many tables and will give you a better understanding. You can view the website here .

Step 3 : Parsing tables

# defining the html contents of a URL.
xhtml = url_get_contents('Link').decode('utf-8')

# Defining the HTMLTableParser object
p = HTMLTableParser()

# feeding the html contents in the
# HTMLTableParser object
p.feed(xhtml)

# Now finally obtaining the data of
# the table required
pprint(p.tables[1])

Each row of the table is stored in an array. This can be converted into a pandas dataframe easily and can be used to perform any analysis.

Complete Code:

Python3

# Library for opening url and creating 
# requests
import urllib.request
 
# pretty-print python data structures
from pprint import pprint
 
# for parsing all the tables present 
# on the website
from html_table_parser.parser import HTMLTableParser
 
# for converting the parsed data in a
# pandas dataframe
import pandas as pd
 
 
# Opens a website and read its
# binary contents (HTTP Response Body)
def url_get_contents(url):
 
    # Opens a website and read its
    # binary contents (HTTP Response Body)
 
    #making request to the website
    req = urllib.request.Request(url=url)
    f = urllib.request.urlopen(req)
 
    #reading contents of the website
    return f.read()
 
# defining the html contents of a URL.
xhtml = url_get_contents('https://www.moneycontrol.com/india\
/stockpricequote/refineries/relianceindustries/RI').decode('utf-8')
 
# Defining the HTMLTableParser object
p = HTMLTableParser()
 
# feeding the html contents in the
# HTMLTableParser object
p.feed(xhtml)
 
# Now finally obtaining the data of
# the table required
pprint(p.tables[1])
 
# converting the parsed data to
# dataframe
print("\n\nPANDAS DATAFRAME\n")
print(pd.DataFrame(p.tables[1]))

Output:

Suggest improvement

Quote Guessing Game using Web Scraping in Python

How to update data in a Collection using Python?

Share your thoughts in the comments

Scrape Tables From any website using Python

Installation

Getting Started

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?