While working with large sets of data, it often contains text data and in many cases, those texts are not pretty at all. The text is often in very messier form and we need to clean those data before we can do anything meaningful with that text data. Mostly the text corpus is so large that we cannot manually list out all the texts that we want to replace. So in those cases, we use regular expressions to deal with such data having some pattern in it.
We have already discussed in the previous article how to replace some known string values in dataframe. In this post, we will use regular expressions to replace strings that have some pattern to it.
Using Dataframe.replace() Function
Problem #1: You are given a dataframe that contains the details about various events in different cities. For those cities which start with the keyword ‘New’ or ‘new’, change it to ‘New_’.
Solution: We are going to use regular expression to detect such names and then we will use Dataframe.replace()
function to replace those names.
Python3
import pandas as pd
df = pd.DataFrame({ 'City' :[ 'New York' , 'Parague' , 'New Delhi' , 'Venice' , 'new Orleans' ],
'Event' :[ 'Music' , 'Poetry' , 'Theatre' , 'Comedy' , 'Tech_Summit' ],
'Cost' :[ 10000 , 5000 , 15000 , 2000 , 12000 ]})
index_ = [pd.Period( '02-2018' ), pd.Period( '04-2018' ),
pd.Period( '06-2018' ), pd.Period( '10-2018' ), pd.Period( '12-2018' )]
df.index = index_
print (df)
|
Output :
City Event Cost
2018-02 New York Music 10000
2018-04 Parague Poetry 5000
2018-06 New Delhi Theatre 15000
2018-10 Venice Comedy 2000
2018-12 new Orleans Tech_Summit 12000
Now we will write the regular expression to match the string and then we will use Dataframe.replace()
function to replace those names.
Python3
df_updated = df.replace(to_replace = '[nN]ew' , value = 'New_' , regex = True )
print (df_updated)
|
Output :
City Event Cost
2018-02 New_ York Music 10000
2018-04 Parague Poetry 5000
2018-06 New_ Delhi Theatre 15000
2018-10 Venice Comedy 2000
2018-12 New_ Orleans Tech_Summit 12000
As we can see in the output, the old strings have been replaced with the new ones successfully.
Problem #2: You are given a dataframe containing details about various events in different cities. The names of certain cities contain some additional details enclosed in a bracket. Search for such names and remove the additional details.
Solutioncontaining: For this task, we will write our own customized function using regular expression to identify and update the names of those cities. Additionally, We will use Dataframe.apply()
function to apply our customized function on each values the column.
Python3
import pandas as pd
df = pd.DataFrame({ 'City' :[ 'New York (City)' , 'Parague' , 'New Delhi (Delhi)' , 'Venice' , 'new Orleans' ],
'Event' :[ 'Music' , 'Poetry' , 'Theatre' , 'Comedy' , 'Tech_Summit' ],
'Cost' :[ 10000 , 5000 , 15000 , 2000 , 12000 ]})
index_ = [pd.Period( '02-2018' ), pd.Period( '04-2018' ),
pd.Period( '06-2018' ), pd.Period( '10-2018' ), pd.Period( '12-2018' )]
df.index = index_
print (df)
|
Output :
City Event Cost
2018-02 New York (City) Music 10000
2018-04 Parague Poetry 5000
2018-06 New Delhi (Delhi) Theatre 15000
2018-10 Venice Comedy 2000
2018-12 new Orleans Tech_Summit 12000
Now we will write our own customized function to match the description in the names of the cities.
Python3
import re
def Clean_names(City_name):
if re.search( '\(.*' , City_name):
pos = re.search( '\(.*' , City_name).start()
return City_name[:pos]
else :
return City_name
df[ 'City' ] = df[ 'City' ]. apply (Clean_names)
print (df)
|
Output :
City Event Cost
2018-02 New York Music 10000
2018-04 Parague Poetry 5000
2018-06 New Delhi Theatre 15000
2018-10 Venice Comedy 2000
2018-12 new Orleans Tech_Summit 12000
Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape,
GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out -
check it out now!
Last Updated :
31 Jul, 2023
Like Article
Save Article