Skip to content
Related Articles

Related Articles

Scrape Tables From any website using Python
  • Last Updated : 29 Dec, 2020
GeeksforGeeks - Summer Carnival Banner

Scraping is a very essential skill for everyone to get data from any website. Scraping and parsing a table can be very tedious work if we use standard Beautiful soup parser to do so. Therefore, here we will be describing a library with the help of which any table can be scraped from any website easily. With this method you don’t even have to inspect element of a website, you only have to provide the URL of the website. That’s it and the work will be done within seconds.

Installation

You can use pip to install this library:

pip install html-table-parser-python3

Getting Started

Step 1: Import the necessary libraries required for the task

# Library for opening url and creating 
# requests
import urllib.request

# pretty-print python data structures
from pprint import pprint

# for parsing all the tables present 
# on the website
from html_table_parser import HTMLTableParser

# for converting the parsed data in a
# pandas dataframe
import pandas as pd

Step 2 : Defining a function to get contents of the website

# Opens a website and read its
# binary contents (HTTP Response Body)
def url_get_contents(url):

    # Opens a website and read its
    # binary contents (HTTP Response Body)

    #making request to the website
    req = urllib.request.Request(url=url)
    f = urllib.request.urlopen(req)

    #reading contents of the website
    return f.read()

Now, our function is ready so we have to specify the url of the website from which we need to parse tables.



Note: Here we will be taking the example of moneycontrol.com website since it has many tables and will give you a better understanding. You can view the website here

Step 3 : Parsing tables

# defining the html contents of a URL.
xhtml = url_get_contents('Link').decode('utf-8')

# Defining the HTMLTableParser object
p = HTMLTableParser()

# feeding the html contents in the
# HTMLTableParser object
p.feed(xhtml)

# Now finally obtaining the data of
# the table required
pprint(p.tables[1])

Each row of the table is stored in an array. This can be converted into a pandas dataframe easily and can be used to perform any analysis. 

Complete Code:

Python3




# Library for opening url and creating 
# requests
import urllib.request
  
# pretty-print python data structures
from pprint import pprint
  
# for parsing all the tables present 
# on the website
from html_table_parser import HTMLTableParser
  
# for converting the parsed data in a
# pandas dataframe
import pandas as pd
  
  
# Opens a website and read its
# binary contents (HTTP Response Body)
def url_get_contents(url):
  
    # Opens a website and read its
    # binary contents (HTTP Response Body)
  
    #making request to the website
    req = urllib.request.Request(url=url)
    f = urllib.request.urlopen(req)
  
    #reading contents of the website
    return f.read()
  
# defining the html contents of a URL.
xhtml = url_get_contents('https://www.moneycontrol.com/india\
/stockpricequote/refineries/relianceindustries/RI').decode('utf-8')
  
# Defining the HTMLTableParser object
p = HTMLTableParser()
  
# feeding the html contents in the
# HTMLTableParser object
p.feed(xhtml)
  
# Now finally obtaining the data of
# the table required
pprint(p.tables[1])
  
# converting the parsed data to
# datframe
print("\n\nPANDAS DATAFRAME\n")
print(pd.DataFrame(p.tables[1]))

Output:

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.

My Personal Notes arrow_drop_up
Recommended Articles
Page :