String Munging In Pandas Dataframe
In this article, we are going to learn about String Munging In Pandas Dataframe. Munging is known as cleaning up anything which was messy by transforming them. In technical terms, we can say that transforming the data in the database into a useful form.
Example: “no-one@example.com”, becomes “no-one at example dot com”
Approach:
Step 1: import the library
Python3
import pandas as pd
import numpy as np
import re as re
|
Step 2: creating Dataframe
Now create a dictionary and pass it through pd.DataFrame to create a Dataframe.
Python3
raw_data = { "first_name" : [ "Jason" , "Molly" , "Tina" , "Jake" , "Amy" ],
"last_name" : [ "Miller" , "Jacobson" , "Ali" , "Milner" , "Cooze" ],
"email" : [ "jas203@gmail.com" , "momomolly@gmail.com" , np.NAN,
"battler@milner.com" , "Ames1234@yahoo.com" ]}
df = pd.DataFrame(raw_data, columns = [ "first_name" , "last_name" , "email" ])
print ()
print (df)
|
Step 3: Applying Different Munging Operation
First, check that in feature “email” which string contains “Gmail”.
Python3
print (df[ "email" ]. str .contains( "gmail" ))
|
Now we want to separate the email into parts such that characters before “@” becomes one string and after and before “.” becomes one. At last, the remaining becomes the one string.
Python3
pattern = "([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
print (df[ "email" ]. str .findall(pattern, flags = re.IGNORECASE))
|
Below is the implementation:
Python3
def ProjectPro_Ex_136():
print ()
print ( '**How we can do string munging in Pandas**' )
import pandas as pd
import numpy as np
import re as re
raw_data = { 'first_name' : [ 'Jason' , 'Molly' , 'Tina' , 'Jake' , 'Amy' ],
'last_name' : [ 'Miller' , 'Jacobson' , 'Ali' , 'Milner' , 'Cooze' ],
'email' : [ 'jas203@gmail.com' , 'momomolly@gmail.com' , np.NAN,
'battler@milner.com' , 'Ames1234@yahoo.com' ]}
df = pd.DataFrame(raw_data, columns = [ 'first_name' , 'last_name' , 'email' ])
print ()
print (df)
print ()
print (df[ 'email' ]. str .contains( 'gmail' ))
pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
print ()
print (df[ 'email' ]. str .findall(pattern, flags = re.IGNORECASE))
ProjectPro_Ex_136()
|
Output:
Last Updated :
03 Jan, 2021
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...