Scraping Wikipedia table with Pandas using read_html()
In this article, we will discuss a particular function named read_html() which is used to read HTML tables directly from a webpage into a Pandas DataFrame without knowing how to scrape a website’s HTML, this tool can be useful for swiftly combining tables from numerous websites. However, the data must have to be cleaned further, So let’s see how we can work on these data.
What is pd.read_html?
Pandas read_html() is one of the easiest ways to scrape web data. The data can further be cleaned as per the requirements of the user.
Syntax of pandas.read_html()
Syntax: pandas.read_htlm(io) Where, io can be an HTML String, a File, or a URL.
Example 1: Using an Html string
In this example, we are storing a multiline string using the notation ‘’’ in a variable called html_string. Then, we call the function read_html and pass the html_string to it. This function extracts all the HTML tables and returns a list of all the tables.
Python3
import pandas as pd html_string = ''' <table> <tr> <th>Company</th> <th>Contact</th> <th>Country</th> </tr> <tr> <td>Alfreds Futterkiste</td> <td>Maria Anders</td> <td>Germany</td> </tr> <tr> <td>Centro comercial Moctezuma</td> <td>Francisco Chang</td> <td>Mexico</td> </tr> </table> ''' df_1 = pd.read_html(html_string) df_1 |
Output:

Further, if you want to look at the datatypes, you can do so by calling the info() function as follows:
df_1[0].info()

Example 2: Reading HTML Data From URL
In this example, let us try to read HTML from a web page. We are using a Wikipedia page with the url=” Demographics_of_India”. From this webpage, I want to scrape the contents of the following table, We need to extract the highlighted columns below:

There are almost 37 tables on the webpage and to find a particular table, we can use the parameter “match”. To find out the length of the data frame, we use the len() function as follows:
Python3
import pandas as pd import numpy as np dfs = pd.read_html('https: / / en.wikipedia.org\ / wiki / Demographics_of_India') len (dfs) |
Output:
37
Example 3: Find the specific table from a webpage
Let us pass the value “Population distribution by states/union territories (2011)” to the parameter match.
Python3
my_table = pd.read_html('https: / / en.wikipedia.org / wiki / \ Demographics_of_India', match = 'Population distribution by states/union territories' ) my_table[ 0 ].head() |

Example 4: Fetch column data
So, we have to get the column ‘State/UT’ and also the column ‘Population’
Python3
states = my_table[ 0 ][ 'State/UT' ] states |

Similarly, we get the column Population
Python3
population = my_table[ 0 ][ 'Population[57]' ] population |

Example 5: Merging two columns
Let us store the two columns in a new DataFrame.
Python3
df1 = pd.DataFrame({ 'State' : states, 'Population' : population}) df1 |
Example 6: Dropping row data
Let’s try to Remove the last row with the help of drop() in Pandas. i.e. total
Python3
df1.drop(df1.tail( 1 ).index, inplace = True ) df1 |
Output:
Example 7: Data Visualisation of table
Here we are using the Matplotlib module to plot the given HTML data in a graphic format.
Python3
import matplotlib.pyplot as plt df1.plot(x = 'State' ,y = 'Population' , kind = "barh" ,figsize = ( 10 , 8 )) |

Example 8: Writing HTML Tables with Python’s Pandas
Here, we created a DataFrame, and we converted it into an HTML file, we have also passed some HTML attributes for making it table beautiful.
Python3
import pandas as pd df = pd.DataFrame({ 'A' : [ 1 , 2 ], 'B' : [ 3 , 4 ]}) df.to_html( 'write_html.html' , index = False , border = 3 , justify = 'center' ) |
Output:
Example 9: Error while rendering an HTML page
If the HTML page doesn’t contain any tables, a value error will return.
Python3
import pandas as pd import numpy as np dfs = pd.read_html('https: / / codebestway.\ wordpress.com / ') |

Please Login to comment...