Open In App

Custom row (List of CustomTypes) to PySpark dataframe

Last Updated : 05 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn about the custom row (List of Custom Types) to PySpark data frame in Python.

We will explore how to create a PySpark data frame from a list of custom objects, where each object represents a row in the data frame. PySpark data frames are a powerful and efficient tool for working with large datasets in a distributed computing environment. They are similar to a table in a relational database or a data frame in R or Python. By creating a data frame from a list of custom objects, we can easily convert structured data into a format that can be analyzed and processed using PySpark’s built-in functions and libraries.

Syntax of CustomType class to create PySpark data frame : 

class CustomType:

    def __init__(self, name, age, salary):

        self.name = name

        self.age = age

        self.salary = salary 

Explaniation: 

  • The keyword class is used to define a new class.
  • CustomType is the name of the class.
  • Inside the class block, we have a special method called __init__, which is used to initialize the object when it is created. The __init__ method takes three arguments: name, age, and salary, and assigns them to the object’s properties with the same name.
  • self is a reference to the object itself, which is passed to the method automatically when the object is created.
  • The property’s name, age, and salary are defined by using self.property_name = value notation.

Approach 1: 

Now in the below example, we are going to create a PySpark data frame from a list of custom objects, where each object represents a row in the data frame. The custom objects contain information about a person, such as their name, age, and salary. In this example, we convert the list of custom objects to a list of Row objects using list comprehension. Then it creates a data frame from the list of Row objects using the createDataFrame method.

Steps 1: The first line imports the Row class from the pyspark.sql module, which is used to create a row object for a data frame.

Step 2: A custom class called CustomType is defined with a constructor that takes in three parameters: name, age, and salary. These will represent the columns of the data frame.

Step 3: A list of CustomType objects is created with three instances, each with a different name, age, and salary.

Step 4: A list comprehension is used to convert the list of CustomType objects into a list of Row objects, where each CustomType object is mapped to a Row object with the same name, age, and salary.

Step 5: The createDataFrame() method is called on the SparkSession object (spark) with the list of Row objects as input, creating a DataFrame.

Step 6: The data frame is displayed using the show() method.

Python3




# Importing required modules
from pyspark.sql import Row
from pyspark.sql import SparkSession
  
# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# Define a custom class to represent a row in the dataframe
class CustomType:
    def __init__(self, name,
                 age, salary):
        self.name = name
        self.age = age
        self.salary = salary
  
# Create a list of CustomType objects
data = [CustomType("John", 30, 5000),
        CustomType("Mary", 25, 6000),
        CustomType("Mike", 35, 7000)]
  
# Convert the list of CustomType
# objects to a list of Row objects
rows = [Row(name=d.name,
            age=d.age, 
            salary=d.salary) for d in data]
  
# Create a dataframe from the list of Row objects
df = spark.createDataFrame(rows)
  
# Show the dataframe
df.show()


Output : 

Custom row (List of CustomTypes) to PySpark dataframe

 

Approach 2: 

In this example, we convert the list of custom objects directly to RDD and then convert it to Dataframe using the createDataFrame() method.

Step 1: The first line imports the Row class from the pyspark.sql module, which is not actually used in this code.

Step 2: A custom class called CustomType is defined with a constructor that takes in three parameters: name, age, and salary. These will represent the columns of the data frame.

Step 3: A list of CustomType objects is created with three instances, each with a different name, age, and salary.

Step 4: The parallelize method of the SparkContext is called with the list of CustomType objects as input, creating an RDD (Resilient Distributed Dataset)

Step 5: The createDataFrame method is called on the SparkSession object (spark) with the RDD as input, creating a DataFrame.

Step 6: The data frame is displayed using the show method.

Python3




from pyspark.sql import Row
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Metadata").getOrCreate()
# Define a custom class to represent a row in the dataframe
class CustomType:
    def __init__(self, name, age, salary):
        self.name = name
        self.age = age
        self.salary = salary
  
# Create a list of CustomType objects
data = [CustomType("John", 30, 5000),
        CustomType("Mary", 25, 6000),
        CustomType("Mike", 35, 7000)]
  
rdd = spark.sparkContext.parallelize(data)
  
# Create a dataframe from the rdd
df = spark.createDataFrame(rdd)
  
# Show the dataframe
df.show()


Output: 

Custom row (List of CustomTypes) to PySpark dataframe

 

Approach 3:

In this approach, we first defined the schema for the data frame using the StructType class. We created three fields name, age, and salary with the type of StringType, IntegerType, and IntegerType respectively. Then, we created a list of custom objects, where each object is a Python dictionary with keys corresponding to the field names in our schema. Finally, we used the createDataFrame() method with the list of custom objects and the schema to create the data frame and display it using the show() method.

Step 1: Define the schema for the data frame using the StructType class: This class allows you to define the structure and types of the columns in the data frame. You can define the name and type of each column using the StructField class.

Step 2: Create a list of custom objects: The custom objects can be in the form of Python dictionaries, where each dictionary represents a row in the data frame and the keys of the dictionary correspond to the column names defined in the schema.

Step 3: Create the data frame: Use the createDataFrame method and pass in the list of custom objects and the schema to create the data frame.

Step 4: Show the data frame: To display the data frame, use the show() method on the data frame object.

Python3




# Importing required modules
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import Row
  
# Create a SparkSession
spark = SparkSession.builder.appName("Myapp").getOrCreate()
  
# step 1: Define the schema for the dataframe
schema = StructType([
    StructField("name",
                StringType(), True),
    StructField("age",
                IntegerType(), True),
    StructField("salary",
                IntegerType(), True)
])
  
# step 2: Create a list of custom objects
data = [{"name": "John",
         "age": 30, "salary": 5000},
        {"name": "Mary",
         "age": 25, "salary": 6000},
        {"name": "Mike",
         "age": 35, "salary": 7000}]
  
# step 3: Create the dataframe
df = spark.createDataFrame(data, schema)
  
# step 4: Show the dataframe
df.show()


Output : 

Custom row (List of CustomTypes) to PySpark dataframe

 

Both three approaches achieve the same result which is a data frame with three rows and three columns, named “name”, “age”, and “salary”. The data in the data frame will be the same as the data in the list of custom objects.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads