How to Scrape Data From Local HTML Files using Python?

Last Updated : 21 Apr, 2021

BeautifulSoup module in Python allows us to scrape data from local HTML files. For some reason, website pages might get stored in a local (offline environment), and whenever in need, there may be requirements to get the data from them. Sometimes there may be a need to get data from multiple Locally stored HTML files too. Usually HTML files got the tags like <h1>, <h2>,…<p>, <div> tags etc., Using BeautifulSoup, we can scrap the contents and get the necessary details.

Installation

It can be installed by typing the below command in the terminal.

pip install beautifulsoup4

Getting Started

If there is an HTML file stored in one location, and we need to scrap the content via Python using BeautifulSoup, the lxml is a great API as it meant for parsing XML and HTML. It supports both one-step parsing and step-by-step parsing.

The Prettify() function in BeautifulSoup helps to view the tag nature and their nesting.

Example: Let’s create a sample HTML file.

Python3

# Necessary imports 
import sys 
import urllib.request 
  
# Save a reference to the original 
# standard output 
original_stdout = sys.stdout 
  
# as an example, taken my article list 
# published link page and stored in local 
with urllib.request.urlopen('https://auth.geeksforgeeks.org/user/priyarajtt/articles') as webPageResponse: 
    outputHtml = webPageResponse.read() 
  
# Scraped contents are placed in  
# samplehtml.html file and getting 
# used for next set of examples 
with open('samplehtml.html', 'w') as f: 
      
    # Here the  standard output is  
    # written to the file that we  
    # used above 
    sys.stdout = f 
    print(outputHtml) 
      
    # Reset the standard output to its  
    # original value 
    sys.stdout = original_stdout 

Output:

Now, use prettify() method to view tags and content in an easier way.

Python3

# Importing BeautifulSoup and  
# it is in the bs4 module 
from bs4 import BeautifulSoup 
  
# Opening the html file. If the file 
# is present in different location,  
# exact location need to be mentioned 
HTMLFileToBeOpened = open("samplehtml.html", "r") 
  
# Reading the file and storing in a variable 
contents = HTMLFileToBeOpened.read() 
  
# Creating a BeautifulSoup object and 
# specifying the parser  
beautifulSoupText = BeautifulSoup(contents, 'lxml') 
  
  
# Using the prettify method to modify the code 
#  Prettify() function in BeautifulSoup helps 
# to view about the tag nature and their nesting 
print(beautifulSoupText.body.prettify()) 

Output :

In this way can get HTML data. Now do some operations and some insightful in the data.

Example 1:

We can use find() methods and as HTML contents dynamically change, we may not be knowing the exact tag name. In that time, we can use findAll(True) to get the tag name first, and then we can do any kind of manipulation. For example, get the tag name and length of the tag

Python3

# Importing BeautifulSoup and it 
# is in the bs4 module 
from bs4 import BeautifulSoup 
  
# Opening the html file. If the file 
# is present in different location,  
# exact location need to be mentioned 
HTMLFileToBeOpened = open("samplehtml.html", "r") 
  
# Reading the file and storing in a variable 
contents = HTMLFileToBeOpened.read() 
  
# Creating a BeautifulSoup object and  
# specifying the parser 
beautifulSoupText = BeautifulSoup(contents, 'lxml') 
  
# To get all the tags present in the html  
# and getting their length 
for tag in beautifulSoupText.findAll(True): 
    print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text)) 

Output:

Example 2 :

Now, instead of scraping one HTML file, we want to do for all the HTML files present in that directory(there may be necessities for such cases as on daily basis, a particular directory may get filled with the online data and as a batch process, scraping has to be carried out).

We can use “os” module functionalities. Let us take the current directory all HTML files for our examples

So our task is to get all HTML files to get scrapped. In the below way, we can achieve. Entire folder HTML files got scraped one by one and their length of tags for all files are retrieved, and it is showcased in the attached video.

Python3

# necessary import for getting 
# directory and filenames 
import os 
from bs4 import BeautifulSoup 
  
# Get current working directory 
directory = os.getcwd() 
  
# for all the files present in that 
# directory 
for filename in os.listdir(directory): 
      
    # check whether the file is having 
    # the extension as html and it can 
    # be done with endswith function 
    if filename.endswith('.html'): 
          
        # os.path.join() method in Python join 
        # one or more path components which helps 
        # to exactly get the file 
        fname = os.path.join(directory, filename) 
        print("Current file name ..", os.path.abspath(fname)) 
          
        # open the file 
        with open(fname, 'r') as file: 
            
            beautifulSoupText = BeautifulSoup(file.read(), 'html.parser') 
              
            # parse the html as you wish 
            for tag in beautifulSoupText.findAll(True): 
                print(tag.name, " : ", len(beautifulSoupText.find(tag.name).text)) 

Output:

Suggest improvement

Beautifulsoup - Kinds of objects

Find the siblings of tags using BeautifulSoup

Share your thoughts in the comments

Installing and Loading BeautifulSoup

Navigating the HTML structure With Beautiful Soup

Searching and Extract for specific tags With Beautiful Soup

Creating new HTML elements With Beautiful Soup

Modifying HTML with BeautifulSoup

Working with CSS selectors With Beautiful Soup

Handling cookies and sessions with BeautifulSoup

How to Scrape Data From Local HTML Files using Python?

Installation

Getting Started

Python3

Python3

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?