Clean the string data in the given Pandas Dataframe
Last Updated :
31 Jul, 2023
In today’s world data analytics is being used by all sorts of companies out there. While working with data, we can come across any sort of problem which requires an out-of-the-box approach for evaluation. Most of the Data in real life contains the name of entities or other nouns. It might be possible that the names are not in a proper format. In this post, we are going to discuss the approaches to clean such data.
Suppose we are dealing with the data of an e-commerce based website. The name of the products is not in the proper format. Properly format the data such that there are no leading and trailing whitespaces as well as the first letters of all products are capital letters.
Creating DataFrame
Let’s consider a DataFrame example, where the product names are not in correct manner.
Python3
import pandas as pd
df = pd.DataFrame({ 'Date' :[ '10/2/2011' , '11/2/2011' , '12/2/2011' , '13/2/2011' ],
'Product' :[ ' UMbreLla' , ' maTtress' , 'BaDmintoN ' , 'Shuttle' ],
'Updated_Price' :[ 1250 , 1450 , 1550 , 400 ],
'Discount' :[ 10 , 8 , 15 , 10 ]})
print (df)
|
Output :
Date Product Updated_Price Discount
0 10/2/2011 UMbreLla 1250 10
1 11/2/2011 maTtress 1450 8
2 12/2/2011 BaDmintoN 1550 15
3 13/2/2011 Shuttle 400 10
Creating custom function
Many times we will come across a situation where we are required to write our own customized function suited for the task at hand. So, now we will write our own customized function to solve this problem.
Python3
def Format_data(df):
for i in range (df.shape[ 0 ]):
df.iat[i, 1 ] = df.iat[i, 1 ].strip().capitalize()
Format_data(df)
print (df)
|
Output :
Date Product Updated_Price Discount
0 10/2/2011 Umbrella 1250 10
1 11/2/2011 Mattress 1450 8
2 12/2/2011 Badminton 1550 15
3 13/2/2011 Shuttle 400 10
Using Pandas Apply Function
Now we will see a better and more efficient approach using Pandas DataFrame.apply() function.
Let’s use the Pandas DataFrame.apply() function to format the Product names in the right format. Inside the Pandas DataFrame.apply() function we will use the lambda function.
Python3
import pandas as pd
df = pd.DataFrame({ 'Date' :[ '10/2/2011' , '11/2/2011' , '12/2/2011' , '13/2/2011' ],
'Product' :[ ' UMbreLla' , ' maTtress' , 'BaDmintoN ' , 'Shuttle' ],
'Updated_Price' :[ 1250 , 1450 , 1550 , 400 ],
'Discount' :[ 10 , 8 , 15 , 10 ]})
df[ 'Product' ] = df[ 'Product' ]. apply ( lambda x : x.strip().capitalize())
print (df)
|
Output :
Date Product Updated_Price Discount
0 10/2/2011 Umbrella 1250 10
1 11/2/2011 Mattress 1450 8
2 12/2/2011 Badminton 1550 15
3 13/2/2011 Shuttle 400 10
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...