Extract JSON from HTML using BeautifulSoup in Python

In this article, we are going to extract JSON from HTML using BeautifulSoup in Python.

Module needed

bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.

pip install bs4

requests: Request allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python. To install this type the below command in the terminal.

pip install requests

Approach:

Import all the required modules.
Pass the URL in the get function(UDF) so that it will pass a GET request to a URL, and it will return a response.

Syntax: requests.get(url, args)

Now Parse the HTML content using bs4.

Syntax: BeautifulSoup(page.text, ‘html.parser’)

Parameters:

page.text : It is the raw HTML content.

html.parser : Specifying the HTML parser we want to use.

Now get all the required data with find() function.

Now find the customer list with li, a, p tag where some unique class or id. You can open the webpage in the browser and inspect the relevant element by pressing right-click as shown in the figure.

Create a Json file and use json.dump() method to convert python objects into appropriate JSON objects.

Below is the full implementation:

Python3

# Import the required modules

import requests

from bs4 import BeautifulSoup

import json
 
# Function will return a list of dictionaries
# each containing information of books.

def json_from_html_using_bs4(base_url):
 
    # requests.get(url) returns a response that is saved

    # in a response object called page.

    page = requests.get(base_url)
 
    # page.text gives us access to the web data in text

    # format, we pass it as an argument to BeautifulSoup

    # along with the html.parser which will create a

    # parsed tree in soup.

    soup = BeautifulSoup(page.text, "html.parser")
 
    # soup.find_all finds the div's, all having the same

    # class "col-xs-6 col-sm-4 col-md-3 col-lg-3" that is

    # stored in books

    books = soup.find_all(

        'li', attrs={'class': 

                'col-xs-6 col-sm-4 col-md-3 col-lg-3'})
 
    # Initialise the required variables

    star = ['One', 'Two', 'Three', 'Four', 'Five']

    res, book_no = [], 1

    # Iterate books classand check for the given tags

    # to get the information of each books.

    for book in books:
 
        # Title of book in <img> tag with "alt" key.

        title = book.find('img')['alt']
 
        # Link of book in <a> tag with "href" key

        link = base_url[:37] + book.find('a')['href']
 
        # Rating of book from 
 
<p> tag

        for index in range(5):

            find_stars = book.find(

            'p', attrs={'class': 'star-rating ' + star[index]})

            # Check which star-rating class is not 

            # returning None and then break the loop

            if find_stars is not None:

                stars = star[index] + " out of 5"

                break
 
        # Price of book from 
 
<p> tag in price_color class

        price = book.find('p', attrs={'class': 'price_color'

                                                    }).text
 
        # Stock Status of book from 
 
<p> tag in

        # instock availability class.

        instock = book.find('p', attrs={'class': 

                        'instock availability'}).text.strip()

        # Create a dictionary with the above book information

        data = {'book no': str(book_no), 'title': title, 

            'rating': stars, 'price': price, 'link': link, 

            'stock': instock}
 
        # Append the dictionary to the list

        res.append(data)

        book_no += 1

    return res
 
# Main Function

if __name__ == "__main__":
 
    # Enter the url of website

    base_url = "https://books.toscrape.com/catalogue/page-1.html"
 
    # Function will return a list of dictionaries

    res = json_from_html_using_bs4(base_url)
 
    # Convert the python objects into json object and export

    # it to books.json file.

    with open('books.json', 'w', encoding='latin-1') as f:

        json.dump(res, f, indent=8, ensure_ascii=False)

    print("Created Json File")

Output:

Created Json File

Our JSON file output:

Article Tags :

Python

Python BeautifulSoup

Python bs4-Exercises