Split a String into columns using regex in pandas DataFrame

Last Updated : 08 Jan, 2019

Given some mixed data containing multiple values as a string, let’s see how can we divide the strings using regex and make multiple columns in Pandas DataFrame.

Method #1:
In this method we will use re.search(pattern, string, flags=0). Here pattern refers to the pattern that we want to search. It takes in a string with the following values:

\w matches alphanumeric characters
\d matches digits, which means 0-9
\s matches whitespace characters
\S matches non-whitespace characters
. matches any character except the new line character \n
* matches 0 or more instances of a pattern

# import the regex library 
import pandas as pd 
import re 
  
# Create a list with all the strings 
movie_data = ["Name: The_Godfather Year: 1972 Rating: 9.2", 
            "Name: Bird_Box Year: 2018 Rating: 6.8", 
            "Name: Fight_Club Year: 1999 Rating: 8.8"] 
  
# Create a dictionary with the required columns  
# Used later to convert to DataFrame 
movies = {"Name":[], "Year":[], "Rating":[]} 
  
for item in movie_data: 
      
    # For Name field 
    name_field = re.search("Name: .*",item) 
      
    if name_field is not None: 
        name = re.search('\w*\s\w*',name_field.group()) 
    else: 
        name = None
    movies["Name"].append(name.group()) 
      
    # For Year field 
    year_field = re.search("Year: .*",item) 
    if year_field is not None: 
        year = re.search('\s\d\d\d\d',year_field.group()) 
    else: 
        year = None
    movies["Year"].append(year.group().strip()) 
      
    # For rating field 
    rating_field = re.search("Rating: .*",item) 
    if rating_field is not None:  
        rating = re.search('\s\d.\d',rating_field.group()) 
    else:  
        rating - None
    movies["Rating"].append(rating.group().strip()) 
  
# Creating DataFrame 
df = pd.DataFrame(movies) 
print(df) 

Output:

Explanation:

In the code above, we use a for loop to iterate through movie data so we can work with each movie in turn. We create a dictionary, movies, that will hold all the details of each detail, such as the rating and name.
We then find the entire Name field using the re.search() function. The . means any character except \n, and * extends it to the end of the line. Assign this to the variable name_field.
But, data isn’t always straightforward. It can contain surprises. For instance, what if there’s no Name: field? The script would throw an error and break. We pre-empt errors from this scenario and check for a not None case.
Again we use the re.search() function to extract the final required string from the name_field. For the name we use \w* to represent the first word, \s to represent the space in between and \w* for the second word.
Do the same for year and rating and get the final required dictionary.

Method #2:
To break up the string we will use Series.str.extract(pat, flags=0, expand=True) function. Here pat refers to the pattern that we want to search for.

import pandas as pd 
  
dict = {'movie_data':['The Godfather 1972 9.2', 
                    'Bird Box 2018 6.8', 
                    'Fight Club 1999 8.8'] } 
  
# Convert the dictionary to a dataframe 
df = pd.DataFrame(dict) 
  
# Extract name from the string  
df['Name'] = df['movie_data'].str.extract('(\w*\s\w*)', expand=True) 
   
# Extract year from the string  
df['Year'] = df['movie_data'].str.extract('(\d\d\d\d)', expand=True) 
  
# Extract rating from the string  
df['Rating'] = df['movie_data'].str.extract('(\d\.\d)', expand=True) 
print(df) 

Output:

Suggest improvement

Getting Unique values from a column in Pandas dataframe

Getting frequency counts of a columns in Pandas DataFrame

Share your thoughts in the comments

Pandas DataFrame Practice Exercises

Pandas Dataframe Rows Practice Exercise

Pandas Dataframe Columns Practice Exercise

Pandas Series Practice Exercise

Pandas Date and Time Practice Exercise

DataFrame String Manipulation

Accessing and Manipulating Data in DataFrame

DataFrame Visualization and Exporting

Data Aggregation and Grouping

Merging and Joining

Filtering and Selecting Data

Select Rows With Multiple Filters in Pandas

Selection and Slicing

Miscellaneous DataFrame Operations

Data Cleaning and Manipulation

Concatenation and Manipulation

DataFrame Sorting and Reordering

DataFrame Transformation and Conversion

DataFrame Filtering and Selection

DataFrame Conversion and Reshaping

Split a String into columns using regex in pandas DataFrame

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?