Skip to content
Related Articles

Related Articles

Improve Article
String Munging In Pandas Dataframe
  • Last Updated : 03 Jan, 2021

In this article, we are going to learn about String Munging In Pandas Dataframe. Munging is known as cleaning up anything which was messy by transforming them. In technical terms, we can say that transforming the data in the database into a useful form.

Example: “”, becomes “no-one at example dot com”


Step 1: import the library


import pandas as pd
import numpy as np
import re as re

Step 2: creating Dataframe

Now create a dictionary and pass it through pd.DataFrame to create a Dataframe.


raw_data = {"first_name": ["Jason", "Molly", "Tina", "Jake", "Amy"],
            "last_name": ["Miller", "Jacobson", "Ali", "Milner", "Cooze"],
            "email": ["", "", np.NAN,
                      "", ""]}
df = pd.DataFrame(raw_data, columns=["first_name", "last_name", "email"])

Step 3: Applying Different Munging Operation

First, check that in feature “email” which string contains “Gmail”.



Now we want to separate the email into parts such that characters before “@” becomes one string and after and before “.” becomes one. At last, the remaining becomes the one string.


pattern = "([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
print(df["email"].str.findall(pattern, flags=re.IGNORECASE))

Below is the implementation:


def ProjectPro_Ex_136():
    print('**How we can do string munging in Pandas**')
    # loading libraries
    import pandas as pd
    import numpy as np
    import re as re
    # Creating dataframe
    raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
                'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
                'email': ['', '', np.NAN,
                          '', '']}
    df = pd.DataFrame(raw_data, columns=['first_name', 'last_name', 'email'])
    # Let us find Which string within the 
    # email column contains ‘gmail’
    # Create a daily expression pattern that
    # breaks apart emails
    pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'
    # Find everything in that contains
    # that pattern
    print(df['email'].str.findall(pattern, flags=re.IGNORECASE))


 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course

My Personal Notes arrow_drop_up
Recommended Articles
Page :