How to split a column with comma separated values in PySpark’s Dataframe?

Last Updated : 28 Dec, 2022

In this article, we are going to learn how to split a column with comma-separated values in a data frame in Pyspark using Python.

This is a part of data processing in which after the data processing process we have to process raw data for visualization. we may get the data in which a column contains comma-separated data which is difficult to visualize using visualizing techniques. so, we have to separate that data into different columns first so that we can perform visualization easily.

Steps to split a column with comma-separated values in PySpark’s Dataframe

Below are the steps to perform the splitting operation on columns in which comma-separated values are present.

Step 1: First of all, import the required libraries, i.e. SparkSession, and functions. The SparkSession library is used to create the session while the functions library gives access to all built-in functions available for the data frame.

from pyspark.sql import SparkSession
from pyspark.sql import functions

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Creating a Spark Context.

sc=spark_session.sparkContext

Step 4: Reading the CSV file or create the data frame using createDataFrame()

data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
                                              sep = ',', inferSchema = True, header = True)

data_frame = spark_session.createDataFrame((Declare_the_dataset))

Step 5: Split the column names with commas and put them in the list.

df2 = df.select('Roll_no', functions.split('column_to_be_split', ', ').alias('column_to_be_split'))

Step 6: Obtain the number of columns in each row using functions.size() function.

df_sizes = df2.select(functions.size('column_to_be_split').alias('column_to_be_split'))

Step 7: In this step, we get the maximum size among all the column sizes available for each row.

nb_columns = df_sizes.agg(functions.max('column_to_be_split')).collect()[0][0]

Step 8: Here, we split the data frame column into different columns in the data frame.

split_df=df2.select( *[df2['column_to_be_split'][i] for i in range(nb_columns)])

Step 9: Next, create a list defining the column names which you want to give to the split columns.

names=['column_name_1','column_name_2','column_name_3']

Step 10: Now, obtain all the column names of a data frame in a list.

total_columns=split_df.columns

Step 11: Then, run a loop to rename the split columns of the data frame.

for i in range(1,len(total_columns)):
    split_df=split_df.withColumnRenamed(total_columns[i], names[i-1])

Step 12: Finally, display the updated data frame.

split_df.show()

Example 1:

In this example, we have created the data frame in which there is one column ‘Full_Name‘ having multiple values ‘First_Name, Middle_Name, and Last_Name‘ separated by a comma ‘,‘ as follows:

How to split a column with comma separated values in PySpark's Dataframe?

We have split “Full_Name” column into various columns by splitting the column names and putting them in the list. Then, we obtained the maximum size of columns for rows and split it into various columns by running the for loop. Later on, we got the names of the new columns in the list and allotted those names to the new columns formed.

Python3

# Python program to split a column  
# with comma separated values 
  
# Import the libraries SparkSession and function libraries 
from pyspark.sql import SparkSession 
from pyspark.sql import functions 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Create a Spark Context 
sc=spark_session.sparkContext 
  
# Create the data frame using createDataFrame function 
df = spark_session.createDataFrame( 
      sc.parallelize([['1', 'Arun, Kumar, Chaudhary'], 
                      ['2', 'Aniket, Singh, Rajpoot'], 
                      ['3', 'Ishita, Rai, Pundir']]), 
                       ["Roll_no", "Full_Name"]) 
  
# Split the column with ',' and put into list 
df2 = df.select('Roll_no', 
                functions.split('Full_Name', ', ').alias('Full_Name')) 
  
# Obtain the number of columns in each row 
df_sizes = df2.select(functions.size('Full_Name').alias('Full_Name')) 
  
# Get the maximum size among all the column sizes available for each row 
nb_columns = df_sizes.agg(functions.max('Full_Name')).collect()[0][0] 
  
# Split the data frame column into different columns in data frame 
split_df=df2.select('Roll_no', 
                    *[df2['Full_Name'][i] for i in range(nb_columns)]) 
  
# Obtain the column names which you want to  
# give to the split columns  
names=['First Name','Middle Name','Last Name'] 
  
# Get all the column names of data frame in a list 
total_columns=split_df.columns 
  
# Run loop to rename the split columns of the data frame 
for i in range(1,len(total_columns)): 
  split_df=split_df.withColumnRenamed(total_columns[i], 
                                      names[i-1]) 
  
# Displayed the split data frame 
split_df.show()

Output:

Example 2:

In this example, we have uploaded the CSV file (link), i.e., basically, a dataset of 6×5, in which there is one column having multiple values separated by a comma “,” as follows:

We have split that column into various columns by splitting the column names and putting them in the list. Then, we obtained the maximum size of columns for rows and split it into various columns by running the for loop. Later on, we got the names of the new columns in the list and allotted those names to the new columns formed.

Python3

# Python program to split a column  
# with comma separated values 
  
# Import the libraries SparkSession and function libraries 
from pyspark.sql import SparkSession 
from pyspark.sql import functions 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Read the CSV file 
df =csv_file = spark_session.read.csv('/content/car_data.csv', 
                  sep = ',', inferSchema = True, header = True) 
  
# Split the column with ',' and put into list 
df2 = df.select('Car Id', 
                functions.split('Car Details', 
                                          ', ').alias('Car Details')) 
  
# Obtain the number of columns in each row 
df_sizes = df2.select(functions.size('Car Details').alias('Car Details')) 
  
# Get the maximum size among all the column sizes available for each row 
nb_columns = df_sizes.agg(functions.max('Car Details')).collect()[0][0] 
  
# Split the data frame column into different columns in data frame 
split_df=df2.select('Car Id', 
                    *[df2['Car Details'][i] for i in range(nb_columns)]) 
  
# Obtain the column names which you want to  
# give to the split columns  
names=['Car Name','Door Count','Fuel Type','Car Type'] 
  
# Get all the column names of data frame in a list 
total_columns=split_df.columns 
  
# Run loop to rename the split columns of the data frame 
for i in range(1,len(total_columns)): 
  split_df=split_df.withColumnRenamed(total_columns[i], names[i-1]) 
  
# Displayed the split data frame 
split_df.show()