Open In App

Read Html File In Python Using Pandas

Last Updated : 15 Feb, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In Python, Pandas is a powerful library commonly used for data manipulation and analysis. While it’s primarily used for working with structured data such as CSV files, Excel spreadsheets, and databases, it’s also capable of reading HTML files and extracting tabular data from them. In this article, we’ll explore how to read an HTML file in Python using Pandas, along with practical examples and explanations.

Read HTML Files in Python Using Pandas

Below are the possible approaches to Read HTML Files in Python Using Pandas.

  • Using read_html() Function
  • Using BeautifulSoup with read_html()
  • Using requests with read_html()
  • Using lxml parser with read_html()

Read HTML Files Using read_html() Function

This approach directly uses the read_html() function provided by pandas. This function is specifically designed to parse HTML tables and return a list of DataFrames corresponding to the tables found in the HTML content. It’s a convenient method when dealing with simple HTML files containing tabular data.

Python3




import pandas as pd
 
def read_html_with_read_html(file_path):
    # Read HTML file into DataFrame using read_html()
    df = pd.read_html(file_path)[0]
    return df
 
# File path
html_file_path = 'data/geeks_for_geeks.html'
 
# Read HTML file using read_html() function
df = read_html_with_read_html(html_file_path)
 
# Display DataFrame
print("Approach 1 Output:")
print(df)


HTML




<!DOCTYPE html>
<html>
<head>
    <title>Table Example</title>
</head>
<body>
 
<table border="1">
    <tr>
        <th>Name</th>
        <th>Topic</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>Introduction to Python</td>
        <td>Python</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Data Structures</td>
        <td>Algorithms</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>Machine Learning Basics</td>
        <td>Machine Learning</td>
        <td>Advanced</td>
    </tr>
</table>
</body>
</html>


Output:

Approach 1 Output:

Name Topic Difficulty
0 Introduction to Python Python Beginner
1 Data Structures Algorithms Intermediate
2 Machine Learning Basics Machine Learning Advanced

Read HTML Files Using BeautifulSoup with read_html()

In this approach, we first use the BeautifulSoup library to parse the HTML file and extract tables from it. BeautifulSoup provides more flexibility in navigating and extracting specific elements from HTML documents. We then pass the extracted tables to the read_html() function to convert them into DataFrames.

Python3




import pandas as pd
from bs4 import BeautifulSoup
 
def read_html_with_beautiful_soup(file_path):
    # Read HTML file
    with open(file_path, 'r') as f:
        # Parse HTML using BeautifulSoup
        soup = BeautifulSoup(f, 'html.parser')
    # Find all tables in the HTML
    tables = soup.find_all('table')
    # Read tables into DataFrame using read_html()
    df = pd.read_html(str(tables))[0]
    return df
 
# File path
html_file_path = 'data/geeks_for_geeks.html'
 
# Read HTML file using BeautifulSoup with read_html()
df = read_html_with_beautiful_soup(html_file_path)
 
# Display DataFrame
print("Approach 2 Output:")
print(df)


HTML




<!DOCTYPE html>
<html>
<head>
    <title>Programming Languages</title>
</head>
<body>
 
<table border="1">
    <tr>
        <th>Code</th>
        <th>Language</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>HTML</td>
        <td>HTML/CSS</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Python</td>
        <td>Python</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>JavaScript</td>
        <td>JavaScript</td>
        <td>Advanced</td>
    </tr>
</table>
 
</body>
</html>


Output:

Approach 2 Output:

Code Language Difficulty
0 HTML HTML/CSS Beginner
1 Python Python Intermediate
2 JavaScript JavaScript Advanced

Read HTML Files Using requests with read_html()

This approach involves fetching the HTML content from a URL using the requests library and then passing the content directly to the read_html() function for parsing. It’s useful when the HTML content is available online and can be accessed via URL. This approach enables automation in data retrieval and is suitable for reading data from remote sources. However, it requires an internet connection to fetch HTML content, dependency on external servers for data retrieval, and potential security risks when fetching data from untrusted sources.

Python3




import pandas as pd
import requests
 
def read_html_with_requests(file_url):
    # Fetch HTML content using requests
    response = requests.get(file_url)
    # Read HTML content into DataFrame using read_html()
    df = pd.read_html(response.content)[0]
    return df
 
# File URL
 
# Read HTML file using requests with read_html()
df = read_html_with_requests(html_file_url)
 
# Display DataFrame
print("Approach 3 Output:")
print(df)


HTML




<!DOCTYPE html>
<html>
<head>
    <title>Topics in Different Categories</title>
</head>
<body>
 
<table border="1">
    <tr>
        <th>Category</th>
        <th>Topic</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>Data Structures</td>
        <td>Algorithms</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Web Development</td>
        <td>HTML/CSS</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>Machine Learning</td>
        <td>Python</td>
        <td>Advanced</td>
    </tr>
</table>
 
</body>
</html>


Output:

Approach 3 Output:

Category Topic Difficulty
0 Data Structures Algorithms Beginner
1 Web Development HTML/CSS Intermediate
2 Machine Learning Python Advanced

Read HTML Files Using lxml parser with read_html()

In the approach, we use the lxml parser in the read_html() function to parse the HTML file. XMLlxml parser is known for its speed and ability to handle large HTML files efficiently. This approach is suitable for cases where performance is a concern or when dealing with large HTML files. While it offers fast and efficient parsing and good performance, especially with large datasets, it requires additional installation of the lxml library and has limited control over parsing compared to BeautifulSoup.

Python3




import pandas as pd
 
# Approach 4: Using lxml parser with read_html()
def read_html_with_lxml(file_path):
    # Read HTML file into DataFrame using read_html() with 'lxml' parser
    df = pd.read_html(file_path, flavor='lxml')[0]
    return df
 
# File path
html_file_path = 'data/geeks_for_geeks.html'
 
# Read HTML file using lxml parser with read_html()
df = read_html_with_lxml(html_file_path)
 
# Display DataFrame
print("Approach 4 Output:")
print(df)


HTML




<!DOCTYPE html>
<html>
<head>
    <title>Book Information</title>
</head>
<body>
 
<table border="1">
    <tr>
        <th>Title</th>
        <th>Author</th>
        <th>Difficulty</th>
    </tr>
    <tr>
        <td>Python Basics</td>
        <td>John Doe</td>
        <td>Beginner</td>
    </tr>
    <tr>
        <td>Data Analysis</td>
        <td>Jane Smith</td>
        <td>Intermediate</td>
    </tr>
    <tr>
        <td>Machine Learning Algorithms</td>
        <td>David Johnson</td>
        <td>Advanced</td>
    </tr>
</table>
 
</body>
</html>


Output:

Approach 4 Output:

Title Author Difficulty
0 Python Basics John Doe Beginner
1 Data Analysis Jane Smith Intermediate
2 Machine Learning Algorithms David Johnson Advanced

Conclusion

In conclusion, Pandas provides multiple methods to read HTML files in Python, offering flexibility based on specific requirements. The read_html() function is a straightforward option for parsing simple HTML tables. Alternatively, utilizing BeautifulSoup allows for more control over complex HTML structures. For web-based content, the integration of requests enables fetching HTML from URLs, while the use of the lxml parser enhances performance with large datasets.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads