Skip to content
Related Articles

Related Articles

Improve Article
Save Article
Like Article

Remove all style, scripts, and HTML tags using BeautifulSoup

  • Last Updated : 25 Feb, 2021

Prerequisite: BeautifulSoup, Requests

Beautiful Soap is a Python library for pulling data out of HTML and XML files. In this article, we are going to discuss how to remove all style, scripts, and HTML tags using beautiful soap.

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course

Required Modules:



  • bs4: Beautiful Soup (bs4) is a python library primarily used to extract data from HTML, XML, and other markup languages. It’s one of the most used libraries for Web Scraping. 
    Run the following command in the terminal to install this library-
pip install bs4
  • requests: This library is used for making HTTP requests in python.
    Run the following command in the terminal to install this library-
pip install requests

Approach:

  • Import bs4 library
  • Create an HTML doc
  • Parse the content into a BeautifulSoup object
  • Iterate over the data to remove the tags from the document using decompose() method
  • Use stripped_strings() method to retrieve the tag content
  • Print the extracted data

Implementation:

Python3




# Import Module
from bs4 import BeautifulSoup
  
# HTML Document
HTML_DOC = """
              <html>
                <head>
                    <title> Geeksforgeeks </title>
                    <style>.call {background-color:black;} </style>
                    <script>getit</script>
                </head>
                <body>
                    is a
                    <div>Computer Science portal.</div>
                </body>
              </html>
            """
  
# Function to remove tags
def remove_tags(html):
  
    # parse html content
    soup = BeautifulSoup(html, "html.parser")
  
    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()
  
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)
  
  
# Print the extracted data
print(remove_tags(HTML_DOC))

Output:

Geeksforgeeks is a Computer Science portal.

Removing all style, scripts, and HTML tags from an URL

Approach:

  • Import bs4 and requests library
  • Get content from the given URL using requests instance
  • Parse the content into a BeautifulSoup object
  • Iterate over the data to remove the tags from the document using decompose() method
  • Use stripped_strings() method to retrieve the tag content
  • Print the extracted data

Implementation:

Python3




# Import Module
from bs4 import BeautifulSoup
import requests
  
# Website URL
  
# Page content from Website URL
page = requests.get(URL)
  
# Function to remove tags
def remove_tags(html):
  
    # parse html content
    soup = BeautifulSoup(html, "html.parser")
  
    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()
  
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)
  
  
# Print the extracted data
print(remove_tags(page.content))

Output:




My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!