Spark dataframe – Split struct column into two columns

Last Updated : 05 Feb, 2023

In this article, we are going to learn how to split the struct column into two columns using PySpark in Python.

Spark is an open-source, distributed processing system that is widely used for big data workloads. It is designed to be fast, easy to use, and flexible, and it provides a wide range of functionality for data processing, including data transformation, aggregation, and analysis.

What is a data frame?

In the context of Spark, a data frame is a distributed collection of data organized into rows and columns. It is similar to a table in a traditional relational database, but it is distributed across a cluster of machines and is designed to handle large amounts of data efficiently. DataFrames in Spark can be created from a variety of sources, including structured and semi-structured data stored in databases, flat files, and streams. They can be manipulated using a rich set of functions and APIs, and they can be used to perform various types of data processing tasks, such as filtering, aggregating, and transforming data.

What is a struct?

In Spark, a struct is a complex data type that allows the storage of multiple fields together within a single column. The fields within a struct can be of different data types and can be nested as well. Structs are similar to structs in C or structs in Python.

A struct column in a DataFrame is defined using the StructType class and its fields are defined using the StructField class. Each field within a struct column has a name, data type, and a Boolean flag indicating whether the field is nullable or not.

For example, a struct column named “address” with fields “city“, and “zip” can be defined as:

StructType(List(
 StructField("city", StringType, true),
 StructField("zip", IntegerType, true)
))

In this example, “address” is the struct column name, city and zip are the fields of the struct column, their respective data types are StringType and IntegerType, and all the fields are nullable.

Why Split the struct column into two columns in a DataFrame?

There are a few reasons why we might want to split a struct column into multiple columns in a DataFrame:

Ease of Use: Struct columns can be difficult to work with, especially when we need to access individual fields within the struct. Splitting the struct column into separate columns makes it easier to access and manipulate the data.

Performance: When working with large datasets, accessing individual fields within a struct can be slow. Splitting the struct column into separate columns allows Spark to access the fields directly and can improve performance.

Joining: Joining data frames on struct columns can be challenging. Splitting the struct column into separate columns allows for more flexibility when joining DataFrames.

Data Analysis: Some data analysis tools are not able to handle struct columns and require the data to be in separate columns. Splitting the struct column allows to use these tools more easily.

Data Governance: Some data governance policies require data to be stored in a normalized format, which means the struct column must be split into multiple columns in order to comply with the policy.

Problem Description:

There may be a case in which we have a data frame in which struct columns are present with multiple values and we need to split that column into two for data processing. For example, consider a DataFrame that contains customer information, with a struct column named “address” that contains the fields “city”, and “zip”. However, it is difficult to work with the data in this format, especially when we need to access individual fields within the struct, or when we need to join this data frame with another data frame on the “city” field or for data processing. So, to do so we need to split that data into two columns.

Spark dataframe - Split struct column into two columns

The desired outcome is to split the struct column “address” into two separate columns, one for each field. The resulting data frame would look like this:

Splitting struct column into two columns using PySpark

To perform the splitting on the struct column firstly we create a data frame with the struct column which has multiple values and then split that column into two columns. Below is the stepwise implementation to do so.

Step 1: The first line imports the SparkSession class from the pyspark.sql module, which is used to create a SparkSession

Step 2: The second line imports the StructType, StructField, StringType, and IntegerType classes from the pyspark.sql.types module, which is used to define the schema of the DataFrame

Step 3: Then we create a SparkSession using the builder.appName(“SplitStructExample”).getOrCreate() function.

Step 4: Next, we create a sample data frame that consists of two records and each record contains name and address fields where the address field is a tuple.

Step 5: The schema of the DataFrame is defined using the StructType, and StructField classes.

Step 6: The createDataFrame() method is used to create a DataFrame from the data and schema.

Step 7: Then, we use the select method along with the alias() function to split the struct column “address” into separate columns “city” and “zip”.

Step 8: Finally, the show method is used to display the new data frame which contains separate columns for the struct fields.

Python3

from pyspark.sql import SparkSession 
from pyspark.sql.types import *
  
# create a SparkSession 
spark = SparkSession.builder.appName("SplitStructExample").getOrCreate() 
  
# create a sample data 
data = [("Alice", ("NYC", 10001)),  
        ("Bob", ("NYC", 10002))] 
  
# define the schema of the DataFrame 
schema = StructType([ 
    StructField("name", StringType()), 
    StructField("address", StructType([ 
        StructField("city", StringType()), 
        StructField("zip", IntegerType()) 
    ])) 
]) 
  
# create DataFrame using createDataFrame method 
df = spark.createDataFrame(data, schema) 
print("Data frame before splitting:") 
df.show() 
  
# Split struct column into separate columns 
df2 = df.select("name", df["address.city"].alias("city"), df["address.zip"].alias("zip")) 
  
# show the new DataFrame 
print("Data frame after splitting:") 
df2.show() 

Output:

Data frame before splitting:
+-----+------------+
| name|     address|
+-----+------------+
|Alice|{NYC, 10001}|
|  Bob|{NYC, 10002}|
+-----+------------+

Data frame after splitting:
+-----+----+-----+
| name|city|  zip|
+-----+----+-----+
|Alice| NYC|10001|
|  Bob| NYC|10002|
+-----+----+-----+

Suggest improvement

Split single column into multiple columns in PySpark DataFrame

Share your thoughts in the comments