How to preprocess string data within a Pandas DataFrame?
Sometimes, the data which we’re working on might be stuffed in a single column, but for us to work on the data, the data should be spread out into different columns and the columns must be of different data types. When all the data is combined in a single string, the string needs to be preprocessed. This article is about preprocessing string data within a Pandas DataFrame.
Method 1: By using PandasSeries.str.extract() function:
Syntax:
Series.str.extract(pat, flags=0, expand=True)
Parameters:
- pat: regex expression which helps us divide data into columns.
- flags: by default 0 no flags, int parameter.
- expand: Returns a DataFrame with one column per capture group if True.
returns:
method returns a dataframe or series
Step 1: Import packages
Pandas package is imported.
Python3
# import packages import pandas as pd |
Step 2: Create dataframe:
pd.DataFrame() method is used to create a dataframe of the dictionary given. We create a dataframe that needs to be preprocessed. All the data resides in a single column in string format at the start.
Python3
# creating data data = { 'CovidData' : [ 'Anhui 1.0 2020-01-22 17:00:00' , 'Beijing 14.0 2020-01-22 17:00:00' , 'Washington 1.0 2020-01-24 17:00:00' , 'Victoria 3.0 2020-01-31 23:59:00' , 'Macau 10.0 2020-02-06 14:23:04' ]} #creating a pandas dataframe dataset = pd.DataFrame(data) |
str. extract() takes a regex expression string and other parameters to extract data into columns. (….-..-.. ..:..:..) is used to extract dates in the form (yyyy-mm-dd hh:mm:ss), Datetime objects are of that format.
Python3
dataset[ 'LastUpdated' ] = dataset[ 'CovidData' ]. str .extract( '(....-..-.. ..:..:..)' , expand = True ) dataset[ 'LastUpdated' ] |
Output:
str. extract() takes a regex expression string ”([A-Za-z]+)”. it extracts strings which have alphabets.
Python3
dataset[ 'State' ] = dataset[ 'CovidData' ]. str .extract( '([A-Za-z]+)' , expand = True ) dataset[ 'State' ] |
Output:
‘(\d+.\d)’ is used to match decimals. + represents one or more numbers before ‘.'(decimal) and one number after the decimal. ex: 12.1, 3.5 etc… .
Python3
dataset[ 'confirmed_cases' ] = dataset[ 'CovidData' ]. str .extract( '(\d+.\d)' , expand = True ) dataset[ 'confirmed_cases' ] |
Output:
Dataframe before preprocessing:
Dataframe after preprocessing:
Method 2: Using apply() function
In this method, we preprocess a dataset that contains movie reviews, it’s the rotten tomatoes dataset. The panda’s package, re and stop_words packages are imported. We store the stop words in a variable called stop_words. Data set is imported with the help of the pd.read_csv() method. We use the apply() method to preprocess string data. str.lower is used to convert all the string data to lower case. re.sub(r'[^\w\s]’, ”, x) helps us get rid of punctuation marks and finally, we remove stop_words from the string data. As the CSV file is huge a part of the data is displayed to see the difference.
To view and download the CSV file click here.
Python3
# import packages import pandas as pd from stop_words import get_stop_words import re # stop words stop_words = get_stop_words( 'en' ) # reading the csv file data = pd.read_csv( 'test.csv' ) print ( 'Before string processing : ' ) print (data[(data[ 'PhraseId' ] > = 157139 ) & ( data[ 'PhraseId' ] < = 157141 )][ 'Phrase' ]) # converting all text to lower case in the Phrase column data[ 'Phrase' ] = data[ 'Phrase' ]. apply ( str .lower) # using regex to remove punctuation data[ 'Phrase' ] = data[ 'Phrase' ]. apply ( lambda x: re.sub(r '[^\w\s]' , '', x) ) # removing stop words data[ 'Phrase' ] = data[ 'Phrase' ]. apply ( lambda x: ' ' .join( w for w in x.split() if w not in stop_words)) print ( 'After string processing : ' ) data[(data[ 'PhraseId' ] > = 157139 ) & (data[ 'PhraseId' ] < = 157141 )][ 'Phrase' ] |
Output:
Please Login to comment...