Open In App

Split a List to Multiple Columns in Pyspark

Last Updated : 02 Jan, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Have you ever been stuck in a situation where you have got the data of numerous columns in one column? Got confused at that time about how to split that dataset? This can be easily achieved in Pyspark in numerous ways. In this article, we will discuss regarding same. 

Modules Required:

Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark.  This module can be installed through the following command in Python:

pip install pyspark

Methods to split a list into multiple columns in Pyspark:

  • Using expr in comprehension list
  • Splitting data frame row-wise and appending in columns
  • Splitting data frame columnwise

Method 1: Using expr in comprehension list

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session, while expr is an SQL function used to execute SQL-like expressions. Also, the types is used to store all the datatypes of Pyspark.

from pyspark.sql import SparkSession
from pyspark.sql.functions import expr 
from pyspark.sql.types import *

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, define the schema for creating the data frame with an array-typed column.

mySchema = StructType([StructField("Heading", StringType(), True),
                       StructField("Column", ArrayType(IntegerType(),True))])

Step 4: Later on, create the data frame that needs to be split into multiple columns.

data_frame = spark_session.createDataFrame([['column_heading1', [column1_data]],
                                            ['column_heading2', [column2_data]]],
                                            schema= mySchema)

Step 5: Finally, split the list into columns using expr() function in the comprehension list.

data_frame.select([expr('Column[' + str(x) + ']') for x in range(0, number_of_columns)]).show()

Example:

In this example, we have defined the schema in which we want to define the data frame and then declared the data frame in the respective schema using the list of the data. Finally, we have split that dataset using expr function in the comprehension list.

Python3




# Python program to split list to multiple columns
# in Pyspark by using expr in comprehension list
 
# Import the SparkSession, expr and types libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr
from pyspark.sql.types import *
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Define schema to create DataFrame with an array typed column.
mySchema = StructType([StructField("Heading",
    StringType(), True), StructField(
    "Column", ArrayType(IntegerType(), True))])
 
# Create the dataframe that needs to be split in multiple columns
data_frame = spark_session.createDataFrame(
    [['A', [1, 2, 3]], ['B', [4, 5, 6]], ['C', [7, 8, 9]]],
    schema=mySchema)
 
# Split list into columns using 'expr()' in a comprehension list
data_frame.select([expr('Column[' + str(x) + ']') for x in range(0, 3)]).show()


Output:

+---------+---------+---------+
|Column[0]|Column[1]|Column[2]|
+---------+---------+---------+
|        1|        2|        3|
|        4|        5|        6|
|        7|        8|        9|
+---------+---------+---------+

Method 2: Splitting data frame row-wise and appending in columns

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session, while Row is used to represent Row in the data frame. Also, the col is used to represent the column in the data frame.

from pyspark.sql import SparkSession
from pyspark import Row
from pyspark.sql.functions import col

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, declare an array that you need to split into multiple columns.

arr=[[row1_data],[row2_data],[row3_data]]

Step 4: Later on, create the number of rows in the data frame.

data_frame = spark_session.createDataFrame([Row(index=1, finalArray = arr[0]),
                                             Row(index=2, finalArray = arr[1]),
                                             Row(index=3, finalArray = arr[2])])

Step 5: Finally, append the columns to the data frame.

data_frame.select([(col("finalArray")[x]).alias("Column "+str(x+1)) for x in range(0, 3)]).show()

Example:

In this example, we have declared the list for which we created the data frame that we have split row-wise and then put that split data in the columns for display.

Python3




# Python program to split list to multiple columns in Pyspark by
# splitting data frame row-wise and appending in columns
 
# Import the SparkSession, Row and col libraries
from pyspark.sql import SparkSession
from pyspark import Row
from pyspark.sql.functions import col
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Declare the list
arr = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
 
# Creating the number of rows in dataframe
data_frame = spark_session.createDataFrame([Row(index=1,
    finalArray=arr[0]), Row(index=2, finalArray=arr[1]),
    Row(index=3, finalArray=arr[2])])
 
# Appending new columns to the dataframe
data_frame.select([(col("finalArray")[x]).alias("Column "+str(x+1))
                   for x in range(0, 3)]).show()


Output:

+--------+--------+--------+
|Column 1|Column 2|Column 3|
+--------+--------+--------+
|       1|       2|       3|
|       4|       5|       6|
|       7|       8|       9|
+--------+--------+--------+

Method 3: Splitting data frame columnwise

Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, create a spark context.

sc=spark_session.sparkContext

Step 4: Later on, create the data frame that needs to be split into multiple columns.

data_frame = spark_session.createDataFrame(sc.parallelize([['column_heading1', [column1_data]],
                                                            ['column_heading2', [column2_data]]]),
                                                            ["key", "value"])

Step 5: Finally, split the data frame column-wise.

data_frame.select("key", data_frame.value[0], data_frame.value[1], data_frame.value[2]).show()

Example:

In this example, we have declared the list using Spark Context and then created the data frame of that list. Further, we have split the list into multiple columns and displayed that split data.

Python3




# Python program to split list to multiple columns
# in Pyspark by splitting data frame columnwise
 
# Import the SparkSession library
from pyspark.sql import SparkSession
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Create a spark context
sc = spark_session.sparkContext
 
# Create the dataframe using list that needs to be split in multiple columns
data_frame = spark_session.createDataFrame(
  sc.parallelize([['Column 1', [1, 2, 3]], [
  'Column 2', [4, 5, 6]], ['Column 3', [7, 8, 9]]]), ["key", "value"])
 
# Splitting dataframe columnwise
data_frame.select(
    "key", data_frame.value[0], data_frame.value[1],
     data_frame.value[2]).show()


Output:

+--------+--------+--------+--------+
|     key|value[0]|value[1]|value[2]|
+--------+--------+--------+--------+
|Column 1|       1|       2|       3|
|Column 2|       4|       5|       6|
|Column 3|       7|       8|       9|
+--------+--------+--------+--------+


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads