Open In App
Related Articles

Split a String into columns using regex in pandas DataFrame

Like Article
Save Article
Report issue
Given some mixed data containing multiple values as a string, let’s see how can we divide the strings using regex and make multiple columns in Pandas DataFrame. Method #1: In this method we will use, string, flags=0). Here pattern refers to the pattern that we want to search. It takes in a string with the following values:
  • \w matches alphanumeric characters
  • \d matches digits, which means 0-9
  • \s matches whitespace characters
  • \S matches non-whitespace characters
  • . matches any character except the new line character \n
  • * matches 0 or more instances of a pattern
# import the regex library
import pandas as pd
import re
# Create a list with all the strings
movie_data = ["Name: The_Godfather Year: 1972 Rating: 9.2",
            "Name: Bird_Box Year: 2018 Rating: 6.8",
            "Name: Fight_Club Year: 1999 Rating: 8.8"]
# Create a dictionary with the required columns 
# Used later to convert to DataFrame
movies = {"Name":[], "Year":[], "Rating":[]}
for item in movie_data:
    # For Name field
    name_field ="Name: .*",item)
    if name_field is not None:
        name ='\w*\s\w*',
        name = None
    # For Year field
    year_field ="Year: .*",item)
    if year_field is not None:
        year ='\s\d\d\d\d',
        year = None
    # For rating field
    rating_field ="Rating: .*",item)
    if rating_field is not None
        rating ='\s\d.\d',
        rating - None
# Creating DataFrame
df = pd.DataFrame(movies)

Output: Explanation:
  • In the code above, we use a for loop to iterate through movie data so we can work with each movie in turn. We create a dictionary, movies, that will hold all the details of each detail, such as the rating and name.
  • We then find the entire Name field using the function. The . means any character except \n, and * extends it to the end of the line. Assign this to the variable name_field.
  • But, data isn’t always straightforward. It can contain surprises. For instance, what if there’s no Name: field? The script would throw an error and break. We pre-empt errors from this scenario and check for a not None case.
  • Again we use the function to extract the final required string from the name_field. For the name we use \w* to represent the first word, \s to represent the space in between and \w* for the second word.
  • Do the same for year and rating and get the final required dictionary.
  Method #2: To break up the string we will use Series.str.extract(pat, flags=0, expand=True) function. Here pat refers to the pattern that we want to search for.
import pandas as pd
dict = {'movie_data':['The Godfather 1972 9.2',
                    'Bird Box 2018 6.8',
                    'Fight Club 1999 8.8'] }
# Convert the dictionary to a dataframe
df = pd.DataFrame(dict)
# Extract name from the string 
df['Name'] = df['movie_data'].str.extract('(\w*\s\w*)', expand=True)
# Extract year from the string 
df['Year'] = df['movie_data'].str.extract('(\d\d\d\d)', expand=True)
# Extract rating from the string 
df['Rating'] = df['movie_data'].str.extract('(\d\.\d)', expand=True)


Last Updated : 08 Jan, 2019
Like Article
Save Article
Share your thoughts in the comments
Similar Reads