How to write the output to HTML file with Python BeautifulSoup?

Last Updated : 08 Apr, 2021

In this article, we are going to write the output to an HTML file with Python BeautifulSoup. BeautifulSoup is a python library majorly used for web scraping but in this article, we will discuss how to write the output to an HTML file.

Modules needed and installation:

pip install bs4

Approach:

We will first import all the required libraries.
Make a get request to the desired URL and extract its page content.
Using the file data type of python write the output in a new file.

Steps to be followed:

Step 1: Import the required libraries.

Python3

# Import libraries 
from bs4 import BeautifulSoup 
import requests

Step 2: We will perform a get request to the Google search engine home page and extract its page content and make a soup object out of it by passing it to beautiful soup, and we will set the markup as html.parser.

Note: if you are extracting a xml page set the markup as xml.parser

Python3

# set the url to perform the get request 
URL = 'https://www.geeksforgeeks.org/how-to-scrape-all-pdf-files-in-a-website/'
page = requests.get(URL) 
  
# load the page content 
text = page.content 
  
# make a soup object by using beautiful 
# soup and set the markup as html parser 
soup = BeautifulSoup(text, "html.parser") 

Step 3: We use the file data type of python and write the soup object in the output file. We will set the encoding to UTF-8. We will use .prettify() function on soup object that will make it easier to read. We will convert the soup object to a string before writing it.

We will store the output file in the same directory with the name output.html

Python3

# open the file in w mode 
# set encoding to UTF-8 
with open("output.html", "w", encoding = 'utf-8') as file: 
    
    # prettify the soup object and convert it into a string   
    file.write(str(soup.prettify()))

Below is the full implementation:

Python3

# Import libraries 
from bs4 import BeautifulSoup 
import requests 
  
# set the url to perform the get request 
URL = 'https://www.geeksforgeeks.org/how-to-scrape-all-pdf-files-in-a-website/'
page = requests.get(URL) 
  
# load the page content 
text = page.content 
  
# make a soup object by using 
# beautiful soup and set the markup as html parser 
soup = BeautifulSoup(text, "html.parser") 
  
# open the file in w mode 
# set encoding to UTF-8 
with open("output.html", "w", encoding = 'utf-8') as file: 
    
    # prettify the soup object and convert it into a string 
    file.write(str(soup.prettify()))