Open In App

How to split a column with comma separated values in PySpark’s Dataframe?

Last Updated : 28 Dec, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn how to split a column with comma-separated values in a data frame in Pyspark using Python.

This is a part of data processing in which after the data processing process we have to process raw data for visualization. we may get the data in which a column contains comma-separated data which is difficult to visualize using visualizing techniques. so, we have to separate that data into different columns first so that we can perform visualization easily.

Steps to split a column with comma-separated values in PySpark’s Dataframe

Below are the steps to perform the splitting operation on columns in which comma-separated values are present.

Step 1: First of all, import the required libraries, i.e. SparkSession, and functions. The SparkSession library is used to create the session while the functions library gives access to all built-in functions available for the data frame.

from pyspark.sql import SparkSession
from pyspark.sql import functions

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Creating a Spark Context.

sc=spark_session.sparkContext

Step 4: Reading the CSV file or create the data frame using createDataFrame()

data_frame=csv_file = spark_session.read.csv('#Path of CSV file',
                                              sep = ',', inferSchema = True, header = True)

or

data_frame = spark_session.createDataFrame((Declare_the_dataset))

Step 5: Split the column names with commas and put them in the list.

df2 = df.select('Roll_no', functions.split('column_to_be_split', ', ').alias('column_to_be_split'))

Step 6: Obtain the number of columns in each row using functions.size() function.

df_sizes = df2.select(functions.size('column_to_be_split').alias('column_to_be_split'))

Step 7: In this step, we get the maximum size among all the column sizes available for each row.

nb_columns = df_sizes.agg(functions.max('column_to_be_split')).collect()[0][0]

Step 8: Here, we split the data frame column into different columns in the data frame.

split_df=df2.select( *[df2['column_to_be_split'][i] for i in range(nb_columns)])

Step 9: Next, create a list defining the column names which you want to give to the split columns.

names=['column_name_1','column_name_2','column_name_3']

Step 10: Now, obtain all the column names of a data frame in a list.

total_columns=split_df.columns

Step 11: Then, run a loop to rename the split columns of the data frame.

for i in range(1,len(total_columns)):
    split_df=split_df.withColumnRenamed(total_columns[i], names[i-1])

Step 12: Finally, display the updated data frame.

split_df.show()

Example 1:

In this example, we have created the data frame in which there is one column ‘Full_Name‘ having multiple values ‘First_Name, Middle_Name, and Last_Name‘ separated by a comma ‘,‘ as follows: 

How to split a column with comma separated values in PySpark's Dataframe?

 

We have split “Full_Name” column into various columns by splitting the column names and putting them in the list. Then, we obtained the maximum size of columns for rows and split it into various columns by running the for loop. Later on, we got the names of the new columns in the list and allotted those names to the new columns formed. 

Python3




# Python program to split a column 
# with comma separated values
  
# Import the libraries SparkSession and function libraries
from pyspark.sql import SparkSession
from pyspark.sql import functions
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Create a Spark Context
sc=spark_session.sparkContext
  
# Create the data frame using createDataFrame function
df = spark_session.createDataFrame(
      sc.parallelize([['1', 'Arun, Kumar, Chaudhary'],
                      ['2', 'Aniket, Singh, Rajpoot'],
                      ['3', 'Ishita, Rai, Pundir']]),
                       ["Roll_no", "Full_Name"])
  
# Split the column with ',' and put into list
df2 = df.select('Roll_no',
                functions.split('Full_Name', ', ').alias('Full_Name'))
  
# Obtain the number of columns in each row
df_sizes = df2.select(functions.size('Full_Name').alias('Full_Name'))
  
# Get the maximum size among all the column sizes available for each row
nb_columns = df_sizes.agg(functions.max('Full_Name')).collect()[0][0]
  
# Split the data frame column into different columns in data frame
split_df=df2.select('Roll_no',
                    *[df2['Full_Name'][i] for i in range(nb_columns)])
  
# Obtain the column names which you want to 
# give to the split columns 
names=['First Name','Middle Name','Last Name']
  
# Get all the column names of data frame in a list
total_columns=split_df.columns
  
# Run loop to rename the split columns of the data frame
for i in range(1,len(total_columns)):
  split_df=split_df.withColumnRenamed(total_columns[i],
                                      names[i-1])
  
# Displayed the split data frame
split_df.show()


Output:

How to split a column with comma separated values in PySpark's Dataframe?

 

Example 2:

In this example, we have uploaded the CSV file (link), i.e., basically, a dataset of 6×5, in which there is one column having multiple values separated by a comma “,” as follows: 

How to split a column with comma separated values in PySpark's Dataframe?

 

We have split that column into various columns by splitting the column names and putting them in the list. Then, we obtained the maximum size of columns for rows and split it into various columns by running the for loop. Later on, we got the names of the new columns in the list and allotted those names to the new columns formed. 

Python3




# Python program to split a column 
# with comma separated values
  
# Import the libraries SparkSession and function libraries
from pyspark.sql import SparkSession
from pyspark.sql import functions
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
df =csv_file = spark_session.read.csv('/content/car_data.csv',
                  sep = ',', inferSchema = True, header = True)
  
# Split the column with ',' and put into list
df2 = df.select('Car Id',
                functions.split('Car Details',
                                          ', ').alias('Car Details'))
  
# Obtain the number of columns in each row
df_sizes = df2.select(functions.size('Car Details').alias('Car Details'))
  
# Get the maximum size among all the column sizes available for each row
nb_columns = df_sizes.agg(functions.max('Car Details')).collect()[0][0]
  
# Split the data frame column into different columns in data frame
split_df=df2.select('Car Id',
                    *[df2['Car Details'][i] for i in range(nb_columns)])
  
# Obtain the column names which you want to 
# give to the split columns 
names=['Car Name','Door Count','Fuel Type','Car Type']
  
# Get all the column names of data frame in a list
total_columns=split_df.columns
  
# Run loop to rename the split columns of the data frame
for i in range(1,len(total_columns)):
  split_df=split_df.withColumnRenamed(total_columns[i], names[i-1])
  
# Displayed the split data frame
split_df.show()


Output:

How to split a column with comma separated values in PySpark's Dataframe?

 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads