Open In App

Clean Web Scraping Data Using clean-text in Python

If you like to play with API’s or like to scrape data from various websites, you must’ve come around random annoying text, numbers, keywords that come around with data. Sometimes it can be really complicating and frustrating to clean scraped data to obtain the actual data that we want. 

In this article, we are going to explore a python library called clean-text which will help you to clean your scraped data in a matter of seconds without writing any fancy, long code. Let’s begin



Installation

Use the following command

pip install clean-text

Note: CleanText package requires Python 3.7 or greater.



Syntax

cleantext.clean_words( text , {operations})

  • text: string
  • operations: mentions below 

Different cleantext operations:

The clean-text function provides a range of arguments that specifies how to clean the given raw text input and return the cleaned text in the form of a string. Here is the list of arguments that you can use to clean your required data.

Code Implementation:




# import library
from cleantext import clean
 
# input string
text = """
    A bunch of \\u2018new\\u2019 references,
    including [Moana]. »Yóù àré rïght <3!«
    """
 
print(clean(text=text,
            fix_unicode=True,
            to_ascii=True,
            lower=True,
            no_line_breaks=False,
            no_urls=False,
            no_emails=False,
            no_phone_numbers=False,
            no_numbers=False,
            no_digits=False,
            no_currency_symbols=False,
            no_punct=False,
            replace_with_punct="",
            replace_with_url="This is a URL",
            replace_with_email="Email",
            replace_with_phone_number="",
            replace_with_number="123",
            replace_with_digit="0",
            replace_with_currency_symbol="$",
            lang="en"
            ))

Output:

 


Article Tags :