Scrape Tables From any website using Python
Scraping is a very essential skill for everyone to get data from any website. Scraping and parsing a table can be very tedious work if we use standard Beautiful soup parser to do so. Therefore, here we will be describing a library with the help of which any table can be scraped from any website easily. With this method you don’t even have to inspect element of a website, you only have to provide the URL of the website. That’s it and the work will be done within seconds.
You can use pip to install this library:
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course
pip install html-table-parser-python3
Step 1: Import the necessary libraries required for the task
# Library for opening url and creating # requests import urllib.request # pretty-print python data structures from pprint import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser # for converting the parsed data in a # pandas dataframe import pandas as pd
Step 2 : Defining a function to get contents of the website
# Opens a website and read its # binary contents (HTTP Response Body) def url_get_contents(url): # Opens a website and read its # binary contents (HTTP Response Body) #making request to the website req = urllib.request.Request(url=url) f = urllib.request.urlopen(req) #reading contents of the website return f.read()
Now, our function is ready so we have to specify the url of the website from which we need to parse tables.
Note: Here we will be taking the example of moneycontrol.com website since it has many tables and will give you a better understanding. You can view the website here .
Step 3 : Parsing tables
# defining the html contents of a URL. xhtml = url_get_contents('Link').decode('utf-8') # Defining the HTMLTableParser object p = HTMLTableParser() # feeding the html contents in the # HTMLTableParser object p.feed(xhtml) # Now finally obtaining the data of # the table required pprint(p.tables)
Each row of the table is stored in an array. This can be converted into a pandas dataframe easily and can be used to perform any analysis.