Python – Efficient Text Data Cleaning

Gone are the days when we used to have data mostly in row-column format, or we can say Structured data. In present times, the data being collected is more unstructured than structured. We have data in the form of text, images, audio etc and the ratio of Structured to Unstructured data has decreased over the years. Unstructured data is increasing at 55-65% every year.

Thus, we need to learn how to work with unstructured data to be able to extract relevant information from it and make it useful. While working with text data it is very important to pre-process it before using it for predictions or analysis.
In this article, we will be learning various text data cleaning techniques using python.

Let’s take a tweet for example:

I enjoyd the event which took place yesteday & I luvd it ! The link to the show is 
http://t.co/4ftYom0i It's awsome you'll luv it #HadFun #Enjoyed BFN GN

We will be performing data cleaning on this tweet step-wise.

Steps for Data Cleaning

1) Clear out HTML characters: A Lot of HTML entities like ' ,& ,< etc can be found in most of the data available on the web. We need to get rid of these from our data. You can do this in two ways:



  • By using specific regular expressions or
  • By using modules or packages available(htmlparser of python)

We will be using the module already available in python. 

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

#Escaping out HTML characters
from html.parser import HTMLParser
  
tweet="I enjoyd the event which took place yesteday & I lovdddd itttt ! The link to the show is http://t.co/4ftYom0i It's awesome you'll luv it #HadFun #Enjoyed BFN GN"  
tweet=HTMLParser().unescape(tweet)
print("After removing HTML characters the tweet is:-\n{}".format(tweet))

chevron_right


Output:

2) Encoding & Decoding Data: It is the process of converting information from simple understandable characters to complex symbols and vice versa. There are different forms of encoding &decoding like “UTF8″,”ascii” etc. available for text data. We should keep our data in a standard encoding format. The most common format is the UTF-8 format.

The given tweet is already in the UTF-8 format so we encoded it to ascii format and then decoded it to UTF-8 format to explain the process.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

#Encode from UTF-8 to ascii
encode_tweet =tweet.encode('ascii','ignore')
print("encode_tweet = \n{}".format(encode_tweet))
  
#decode from ascii to UTF-8
decode_tweet=encode_tweet.decode(encoding='UTF-8')
print("decode_tweet = \n{}".format(decode_tweet))

chevron_right


 Output:



3) Removing URLs, Hashtags and Styles: In our text dataset, we can have hyperlinks, hashtags or styles like retweet text for twitter dataset etc. These provide no relevant information and can be removed. In hashtags, only the hash sign ‘#’ will be removed. For this, we will use the re library to perform regular expression operations.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

#library for regular expressions
import re    
  
# remove hyperlinks
tweet = re.sub(r'https?:\/\/.\S+', "", tweet)
  
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
  
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)
  
print("After removing Hashtags,URLs and Styles the tweet is:-\n{}".format(tweet))

chevron_right


Output:

4) Contraction Replacement: The text data might contain apostrophe’s used for contractions. Example- “didn’t” for “did not” etc. This can change the sense of the word or sentence. Hence we need to replace these apostrophes with the standard lexicons. To do so we can have a dictionary which consists of the value with which the word needs to be replaced and use that.

Few of the contractions used are:-
n't --> not        'll --> will
's  --> is        'd  --> would
'm  --> am        've --> have
're --> are

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

#dictionary consisting of the contraction and the actual value
Apos_dict={"'s":" is","n't":" not","'m":" am","'ll":" will",
           "'d":" would","'ve":" have","'re":" are"}
  
#replace the contractions
for key,value in Apos_dict.items():
    if key in tweet:
        tweet=tweet.replace(key,value)
  
print("After Contraction replacement the tweet is:-\n{}".format(tweet))

chevron_right


Output:

5) Split attached words:  Some words are joined together for example – “ForTheWin”. These need to be separated to be able to extract the meaning out of it. After splitting, it will be “For The Win”



Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

import re
#separate the words
tweet = " ".join([s for s in re.split("([A-Z][a-z]+[^A-Z]*)",tweet) if s])
print("After spliting attached words the tweet is:-\n{}".format(tweet))

chevron_right


Output:

6 )Convert to lower case: Convert your text to lower case to avoid case sensitivity related issues.

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

#convert to lower case
tweet=tweet.lower()
print("After converting to lower case the tweet is:-\n{}".format(tweet))

chevron_right


Output:

7) Slang lookup: There are many slang words which are used nowadays, and they can be found in the text data. So we need to replace them with their meanings. We can use a dictionary of slang words as we did for the contraction replacement, or we can create a file consisting of the slang words. Examples of slang words are:-

asap --> as soon as possible
b4   --> before
lol  --> laugh out loud
luv  --> love
wtg  --> way to go

We are using a file which consists of the words. You can download the file slang.txt. Source of this file was taken from here.

Code:



filter_none

edit
close

play_arrow

link
brightness_4
code

#open the fle slang.txt
file=open("slang.txt","r")
slang=file.read()
  
#seperating each line present in the file
slang=slang.split('\n')
  
tweet_tokens=tweet.split()
slang_word=[]
meaning=[]
  
#store the slang words and meanings in different lists
for line in slang:
    temp=line.split("=")
    slang_word.append(temp[0])
    meaning.append(temp[-1])
  
#replace the slang word with meaning
for i,word in enumerate(tweet_tokens):
    if word in slang_word:
        idx=slang_word.index(word)
        tweet_tokens[i]=meaning[idx]
          
tweet=" ".join(tweet_tokens)
print("After slang replacement the tweet is:-\n{}".format(tweet))

chevron_right


Output:

8) Standardizing and Spell Check: There might be spelling errors in the text or it might not be in the correct format. For example – “drivng” for “driving” or “I misssss this” for “I miss this”. We can correct these by using the autocorrect library for python. There are other libraries available which you can use as well. First, you will have to install the library by using the command-

#install autocorrect library
 pip install autocorrect

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

import itertools
#One letter in a word should not be present more than twice in continuation
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
print("After standardizing the tweet is:-\n{}".format(tweet))
  
from autocorrect import Speller 
spell = Speller(lang='en')
#spell check
tweet=spell(tweet)
print("After Spell check the tweet is:-\n{}".format(tweet))

chevron_right


Output:

9) Remove Stopwords: Stop words are the words which occur frequently in the text but add no significant meaning to it. For this, we will be using the nltk library which consists of modules for pre-processing data. It provides us with a list of stop words. You can create your own stopwords list as well according to the use case.

First, make sure you have the nltk library installed. If not then download it using the command-

#install nltk library
 pip install nltk

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

import nltk
#download the stopwords from nltk using
nltk.download('stopwords')
#import stopwords
from nltk.corpus import stopwords
  
#import english stopwords list from nltk
stopwords_eng = stopwords.words('english'
  
tweet_tokens=tweet.split()
tweet_list=[]
#remove stopwords
for word in tweet_tokens:
    if word not in stopwords_eng:
        tweet_list.append(word)
  
print("tweet_list = {}".format(tweet_list))

chevron_right


Output:



10) Remove Punctuations: Punctuations consists of !,<@#&$ etc. 

Code:

filter_none

edit
close

play_arrow

link
brightness_4
code

#for string operations
import string          
clean_tweet=[]
#remove punctuations
for word in tweet_list:
    if word not in string.punctuation:
        clean_tweet.append(word)
  
print("clean_tweet = {}".format(clean_tweet))

chevron_right


Output:

These were some data cleaning techniques which we usually perform on the text data format. You can also perform some advanced data cleaning like grammar check etc. 

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.