Custom row (List of CustomTypes) to PySpark dataframe

Last Updated : 05 Feb, 2023

In this article, we are going to learn about the custom row (List of Custom Types) to PySpark data frame in Python.

We will explore how to create a PySpark data frame from a list of custom objects, where each object represents a row in the data frame. PySpark data frames are a powerful and efficient tool for working with large datasets in a distributed computing environment. They are similar to a table in a relational database or a data frame in R or Python. By creating a data frame from a list of custom objects, we can easily convert structured data into a format that can be analyzed and processed using PySpark’s built-in functions and libraries.

Syntax of CustomType class to create PySpark data frame :

class CustomType:

def __init__(self, name, age, salary):

self.name = name

self.age = age

self.salary = salary

Explaniation:

The keyword class is used to define a new class.

CustomType is the name of the class.

Inside the class block, we have a special method called __init__, which is used to initialize the object when it is created. The __init__ method takes three arguments: name, age, and salary, and assigns them to the object’s properties with the same name.

self is a reference to the object itself, which is passed to the method automatically when the object is created.

The property’s name, age, and salary are defined by using self.property_name = value notation.

Approach 1:

Now in the below example, we are going to create a PySpark data frame from a list of custom objects, where each object represents a row in the data frame. The custom objects contain information about a person, such as their name, age, and salary. In this example, we convert the list of custom objects to a list of Row objects using list comprehension. Then it creates a data frame from the list of Row objects using the createDataFrame method.

Steps 1: The first line imports the Row class from the pyspark.sql module, which is used to create a row object for a data frame.

Step 2: A custom class called CustomType is defined with a constructor that takes in three parameters: name, age, and salary. These will represent the columns of the data frame.

Step 3: A list of CustomType objects is created with three instances, each with a different name, age, and salary.

Step 4: A list comprehension is used to convert the list of CustomType objects into a list of Row objects, where each CustomType object is mapped to a Row object with the same name, age, and salary.

Step 5: The createDataFrame() method is called on the SparkSession object (spark) with the list of Row objects as input, creating a DataFrame.

Step 6: The data frame is displayed using the show() method.

Python3

# Importing required modules 
from pyspark.sql import Row 
from pyspark.sql import SparkSession 
  
# Create a SparkSession 
spark = SparkSession.builder.appName("MyApp").getOrCreate() 
# Define a custom class to represent a row in the dataframe 
class CustomType: 
    def __init__(self, name, 
                 age, salary): 
        self.name = name 
        self.age = age 
        self.salary = salary 
  
# Create a list of CustomType objects 
data = [CustomType("John", 30, 5000), 
        CustomType("Mary", 25, 6000), 
        CustomType("Mike", 35, 7000)] 
  
# Convert the list of CustomType 
# objects to a list of Row objects 
rows = [Row(name=d.name, 
            age=d.age,  
            salary=d.salary) for d in data] 
  
# Create a dataframe from the list of Row objects 
df = spark.createDataFrame(rows) 
  
# Show the dataframe 
df.show() 

Output :

Custom row (List of CustomTypes) to PySpark dataframe

Approach 2:

In this example, we convert the list of custom objects directly to RDD and then convert it to Dataframe using the createDataFrame() method.

Step 1: The first line imports the Row class from the pyspark.sql module, which is not actually used in this code.

Step 2: A custom class called CustomType is defined with a constructor that takes in three parameters: name, age, and salary. These will represent the columns of the data frame.

Step 3: A list of CustomType objects is created with three instances, each with a different name, age, and salary.

Step 4: The parallelize method of the SparkContext is called with the list of CustomType objects as input, creating an RDD (Resilient Distributed Dataset)

Step 5: The createDataFrame method is called on the SparkSession object (spark) with the RDD as input, creating a DataFrame.

Step 6: The data frame is displayed using the show method.

Python3

from pyspark.sql import Row 
from pyspark.sql import SparkSession 
# Create a SparkSession 
spark = SparkSession.builder.appName("Metadata").getOrCreate() 
# Define a custom class to represent a row in the dataframe 
class CustomType: 
    def __init__(self, name, age, salary): 
        self.name = name 
        self.age = age 
        self.salary = salary 
  
# Create a list of CustomType objects 
data = [CustomType("John", 30, 5000), 
        CustomType("Mary", 25, 6000), 
        CustomType("Mike", 35, 7000)] 
  
rdd = spark.sparkContext.parallelize(data) 
  
# Create a dataframe from the rdd 
df = spark.createDataFrame(rdd) 
  
# Show the dataframe 
df.show() 

Output:

Approach 3:

In this approach, we first defined the schema for the data frame using the StructType class. We created three fields name, age, and salary with the type of StringType, IntegerType, and IntegerType respectively. Then, we created a list of custom objects, where each object is a Python dictionary with keys corresponding to the field names in our schema. Finally, we used the createDataFrame() method with the list of custom objects and the schema to create the data frame and display it using the show() method.

Step 1: Define the schema for the data frame using the StructType class: This class allows you to define the structure and types of the columns in the data frame. You can define the name and type of each column using the StructField class.

Step 2: Create a list of custom objects: The custom objects can be in the form of Python dictionaries, where each dictionary represents a row in the data frame and the keys of the dictionary correspond to the column names defined in the schema.

Step 3: Create the data frame: Use the createDataFrame method and pass in the list of custom objects and the schema to create the data frame.

Step 4: Show the data frame: To display the data frame, use the show() method on the data frame object.

Python3

# Importing required modules 
from pyspark.sql import SparkSession 
from pyspark.sql.types import StructType, StructField, StringType, IntegerType 
from pyspark.sql import Row 
  
# Create a SparkSession 
spark = SparkSession.builder.appName("Myapp").getOrCreate() 
  
# step 1: Define the schema for the dataframe 
schema = StructType([ 
    StructField("name", 
                StringType(), True), 
    StructField("age", 
                IntegerType(), True), 
    StructField("salary", 
                IntegerType(), True) 
]) 
  
# step 2: Create a list of custom objects 
data = [{"name": "John", 
         "age": 30, "salary": 5000}, 
        {"name": "Mary", 
         "age": 25, "salary": 6000}, 
        {"name": "Mike", 
         "age": 35, "salary": 7000}] 
  
# step 3: Create the dataframe 
df = spark.createDataFrame(data, schema) 
  
# step 4: Show the dataframe 
df.show()

Output :

Both three approaches achieve the same result which is a data frame with three rows and three columns, named “name”, “age”, and “salary”. The data in the data frame will be the same as the data in the list of custom objects.

Suggest improvement

Convert PySpark Row List to Pandas DataFrame

Share your thoughts in the comments

Custom row (List of CustomTypes) to PySpark dataframe

Syntax of CustomType class to create PySpark data frame :

Approach 1:

Python3

Approach 2:

Python3

Approach 3:

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?