Open In App

Convert HTML source code to JSON Object using Python

In this post, we will see how we can convert an HTML source code into a JSON object. JSON objects can be easily transferred, and they are supported by most of the modern programming languages. We can read JSON from Javascript and parse it as a Javascript object easily. Javascript can be used to make HTML for your web pages. 

We will use xmltojson module in this post. The parse function of this module takes the HTML as the input and returns the parsed JSON string.



Syntax: xmltojson.parse(xml_input, xml_attribs=True, item_depth=0, item_callback)

Parameters:



  • xml_input can be either a file or a string.
  • xml_attribs will include attributes if set to True. Otherwise, ignore them if set to False.
  • item_depth is the depth of children for which item_callback function is called when found.
  • item_callback is a callback function

Environment Setup:

Install the required modules :

pip install xmltojson
pip install requests

Steps:




import xmltojson
import json
import requests




# Sample URL to fetch the html page
  
# Headers to mimic the browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 \
    (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
  
# Get the page through get() method
html_response = requests.get(url=url, headers = headers)
  
# Save the page content as sample.html
with open("sample.html", "w") as html_file:
    html_file.write(html_response.text)




with open("sample.html", "r") as html_file:
    html = html_file.read()
    json_ = xmltojson.parse(html)




with open("data.json", "w") as file:
    json.dump(json_, file)




print(json_)

Complete Code:




import xmltojson
import json
import requests
  
  
# Sample URL to fetch the html page
  
# Headers to mimic the browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 \
    (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
  
# Get the page through get() method
html_response = requests.get(url=url, headers = headers)
  
# Save the page content as sample.html
with open("sample.html", "w") as html_file:
    html_file.write(html_response.text)
      
with open("sample.html", "r") as html_file:
    html = html_file.read()
    json_ = xmltojson.parse(html)
      
with open("data.json", "w") as file:
    json.dump(json_, file)
      
print(json_)

Output:

{“html”: {“@lang”: “en”, “head”: {“title”: “Document”}, “body”: {“div”: {“h1”: “Geeks For Geeks”, “p”: 

“Welcome to the world of programming geeks!”, “input”: [{“@type”: “text”, “@placeholder”: “Enter your name”}, 

{“@type”: “button”, “@value”: “submit”}]}}}}


Article Tags :