Read Html File In Python Using Pandas

Last Updated : 15 Feb, 2024

In Python, Pandas is a powerful library commonly used for data manipulation and analysis. While it’s primarily used for working with structured data such as CSV files, Excel spreadsheets, and databases, it’s also capable of reading HTML files and extracting tabular data from them. In this article, we’ll explore how to read an HTML file in Python using Pandas, along with practical examples and explanations.

Read HTML Files in Python Using Pandas

Below are the possible approaches to Read HTML Files in Python Using Pandas.

Using read_html() Function
Using BeautifulSoup with read_html()
Using requests with read_html()
Using lxml parser with read_html()

Read HTML Files Using read_html() Function

This approach directly uses the read_html() function provided by pandas. This function is specifically designed to parse HTML tables and return a list of DataFrames corresponding to the tables found in the HTML content. It’s a convenient method when dealing with simple HTML files containing tabular data.

Python3

import pandas as pd
 
def read_html_with_read_html(file_path):
    # Read HTML file into DataFrame using read_html()
    df = pd.read_html(file_path)[0]
    return df
 
# File path
html_file_path = 'data/geeks_for_geeks.html'
 
# Read HTML file using read_html() function
df = read_html_with_read_html(html_file_path)
 
# Display DataFrame
print("Approach 1 Output:")
print(df)

HTML

<!DOCTYPE html>
<html>
<head>
    <title>Table Example</title>
</head>
<body>
 
<table border="1">
    <tr>
        <th>Name</th>
        <th>Topic</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>Introduction to Python</td>
        <td>Python</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Data Structures</td>
        <td>Algorithms</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>Machine Learning Basics</td>
        <td>Machine Learning</td>
        <td>Advanced</td>
    </tr>
</table>
</body>
</html>

Output:

Approach 1 Output:

                      Name                                     Topic                                 Difficulty
0         Introduction to Python                  Python                              Beginner
1          Data Structures                           Algorithms                       Intermediate
2          Machine Learning Basics       Machine Learning               Advanced

Read HTML Files Using BeautifulSoup with read_html()

In this approach, we first use the BeautifulSoup library to parse the HTML file and extract tables from it. BeautifulSoup provides more flexibility in navigating and extracting specific elements from HTML documents. We then pass the extracted tables to the read_html() function to convert them into DataFrames.

Python3

import pandas as pd
from bs4 import BeautifulSoup
 
def read_html_with_beautiful_soup(file_path):
    # Read HTML file
    with open(file_path, 'r') as f:
        # Parse HTML using BeautifulSoup
        soup = BeautifulSoup(f, 'html.parser')
    # Find all tables in the HTML
    tables = soup.find_all('table')
    # Read tables into DataFrame using read_html()
    df = pd.read_html(str(tables))[0]
    return df
 
# File path
html_file_path = 'data/geeks_for_geeks.html'
 
# Read HTML file using BeautifulSoup with read_html()
df = read_html_with_beautiful_soup(html_file_path)
 
# Display DataFrame
print("Approach 2 Output:")
print(df)

HTML

<!DOCTYPE html>
<html>
<head>
    <title>Programming Languages</title>
</head>
<body>
 
<table border="1">
    <tr>
        <th>Code</th>
        <th>Language</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>HTML</td>
        <td>HTML/CSS</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Python</td>
        <td>Python</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>JavaScript</td>
        <td>JavaScript</td>
        <td>Advanced</td>
    </tr>
</table>
 
</body>
</html>

Output:

Approach 2 Output:

            Code           Language          Difficulty
0         HTML          HTML/CSS         Beginner
1         Python           Python           Intermediate
2      JavaScript      JavaScript          Advanced

Read HTML Files Using requests with read_html()

This approach involves fetching the HTML content from a URL using the requests library and then passing the content directly to the read_html() function for parsing. It’s useful when the HTML content is available online and can be accessed via URL. This approach enables automation in data retrieval and is suitable for reading data from remote sources. However, it requires an internet connection to fetch HTML content, dependency on external servers for data retrieval, and potential security risks when fetching data from untrusted sources.

Python3

import pandas as pd
import requests
 
def read_html_with_requests(file_url):
    # Fetch HTML content using requests
    response = requests.get(file_url)
    # Read HTML content into DataFrame using read_html()
    df = pd.read_html(response.content)[0]
    return df
 
# File URL
html_file_url = 'https://media.geeksforgeeks.org/wp-content/uploads/20240213175028/geeks_for_geeks.html'
 
# Read HTML file using requests with read_html()
df = read_html_with_requests(html_file_url)
 
# Display DataFrame
print("Approach 3 Output:")
print(df)

HTML

<!DOCTYPE html>
<html>
<head>
    <title>Topics in Different Categories</title>
</head>
<body>
 
<table border="1">
    <tr>
        <th>Category</th>
        <th>Topic</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>Data Structures</td>
        <td>Algorithms</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Web Development</td>
        <td>HTML/CSS</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>Machine Learning</td>
        <td>Python</td>
        <td>Advanced</td>
    </tr>
</table>
 
</body>
</html>

Output:

Approach 3 Output:

              Category                        Topic                Difficulty
0          Data Structures           Algorithms         Beginner
1         Web Development       HTML/CSS       Intermediate
2         Machine Learning         Python              Advanced

Read HTML Files Using lxml parser with read_html()

In the approach, we use the lxml parser in the read_html() function to parse the HTML file. XMLlxml parser is known for its speed and ability to handle large HTML files efficiently. This approach is suitable for cases where performance is a concern or when dealing with large HTML files. While it offers fast and efficient parsing and good performance, especially with large datasets, it requires additional installation of the lxml library and has limited control over parsing compared to BeautifulSoup.

Python3

import pandas as pd
 
# Approach 4: Using lxml parser with read_html()
def read_html_with_lxml(file_path):
    # Read HTML file into DataFrame using read_html() with 'lxml' parser
    df = pd.read_html(file_path, flavor='lxml')[0]
    return df
 
# File path
html_file_path = 'data/geeks_for_geeks.html'
 
# Read HTML file using lxml parser with read_html()
df = read_html_with_lxml(html_file_path)
 
# Display DataFrame
print("Approach 4 Output:")
print(df)

HTML

<!DOCTYPE html>
<html>
<head>
    <title>Book Information</title>
</head>
<body>
 
<table border="1">
    <tr>
        <th>Title</th>
        <th>Author</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>Python Basics</td>
        <td>John Doe</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Data Analysis</td>
        <td>Jane Smith</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>Machine Learning Algorithms</td>
        <td>David Johnson</td>
        <td>Advanced</td>
    </tr>
</table>
 
</body>
</html>

Output:

Approach 4 Output:

                         Title                                       Author                  Difficulty
0                Python Basics                           John Doe              Beginner
1                Data Analysis                          Jane Smith            Intermediate
2  Machine Learning Algorithms         David Johnson         Advanced

Conclusion

In conclusion, Pandas provides multiple methods to read HTML files in Python, offering flexibility based on specific requirements. The read_html() function is a straightforward option for parsing simple HTML tables. Alternatively, utilizing BeautifulSoup allows for more control over complex HTML structures. For web-based content, the integration of requests enables fetching HTML from URLs, while the use of the lxml parser enhances performance with large datasets.

Suggest improvement

How to make HTML files open in Chrome using Python?

Share your thoughts in the comments

Read Html File In Python Using Pandas

Read HTML Files in Python Using Pandas

Read HTML Files Using read_html() Function

Python3

HTML

Output:

Read HTML Files Using BeautifulSoup with read_html()

Python3

HTML

Output:

Read HTML Files Using requests with read_html()

Python3

HTML

Output:

Read HTML Files Using lxml parser with read_html()

Python3

HTML

Output:

Conclusion

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?