How to preprocess string data within a Pandas DataFrame?
Sometimes, the data which we’re working on might be stuffed in a single column, but for us to work on the data, the data should be spread out into different columns and the columns must be of different data types. When all the data is combined in a single string, the string needs to be preprocessed. This article is about preprocessing string data within a Pandas DataFrame.
Method 1: By using PandasSeries.str.extract() function:
Series.str.extract(pat, flags=0, expand=True)
- pat: regex expression which helps us divide data into columns.
- flags: by default 0 no flags, int parameter.
- expand: Returns a DataFrame with one column per capture group if True.
method returns a dataframe or series
Step 1: Import packages
Pandas package is imported.
Step 2: Create dataframe:
pd.DataFrame() method is used to create a dataframe of the dictionary given. We create a dataframe that needs to be preprocessed. All the data resides in a single column in string format at the start.
str. extract() takes a regex expression string and other parameters to extract data into columns. (….-..-.. ..:..:..) is used to extract dates in the form (yyyy-mm-dd hh:mm:ss), Datetime objects are of that format.
str. extract() takes a regex expression string ”([A-Za-z]+)”. it extracts strings which have alphabets.
‘(\d+.\d)’ is used to match decimals. + represents one or more numbers before ‘.'(decimal) and one number after the decimal. ex: 12.1, 3.5 etc… .
Dataframe before preprocessing:
Dataframe after preprocessing:
Method 2: Using apply() function
In this method, we preprocess a dataset that contains movie reviews, it’s the rotten tomatoes dataset. The panda’s package, re and stop_words packages are imported. We store the stop words in a variable called stop_words. Data set is imported with the help of the pd.read_csv() method. We use the apply() method to preprocess string data. str.lower is used to convert all the string data to lower case. re.sub(r'[^\w\s]’, ”, x) helps us get rid of punctuation marks and finally, we remove stop_words from the string data. As the CSV file is huge a part of the data is displayed to see the difference.
To view and download the CSV file click here.
Please Login to comment...