Append data to an empty dataframe in PySpark

Last Updated : 05 Apr, 2022

In this article, we are going to see how to append data to an empty DataFrame in PySpark in the Python programming language.

Method 1: Make an empty DataFrame and make a union with a non-empty DataFrame with the same schema

The union() function is the most important for this operation. It is used to mix two DataFrames that have an equivalent schema of the columns.

Syntax : FirstDataFrame.union(Second DataFrame)

Returns : DataFrame with rows of both DataFrames.

Example:

In this example, we create a DataFrame with a particular schema and data create an EMPTY DataFrame with the same scheme and do a union of these two DataFrames using the union() function in the python language.

Python

# Importing PySpark and the SparkSession
# DataType functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Creating a spark session
spark_session = SparkSession.builder.appName(
    'Spark_Session').getOrCreate()
 
# Creating an empty RDD to make a 
# DataFrame with no data
emp_RDD = spark_session.sparkContext.emptyRDD()
 
# Defining the schema of the DataFrame
columns1 = StructType([StructField('Name', StringType(), False),
                       StructField('Salary', IntegerType(), False)])
 
# Creating an empty DataFrame
first_df = spark_session.createDataFrame(data=emp_RDD,
                                         schema=columns1)
 
# Printing the DataFrame with no data
first_df.show()
 
# Hardcoded data for the second DataFrame
rows = [['Ajay', 56000], ['Srikanth', 89078],
        ['Reddy', 76890], ['Gursaidutt', 98023]]
columns = ['Name', 'Salary']
 
# Creating the DataFrame
second_df = spark_session.createDataFrame(rows, columns)
 
# Printing the non-empty DataFrame
second_df.show()
 
# Storing the union of first_df and 
# second_df in first_df
first_df = first_df.union(second_df)
 
# Our first DataFrame that was empty,
# now has data
first_df.show()

Output :

+----+------+
|Name|Salary|
+----+------+
+----+------+

+----------+------+
|      Name|Salary|
+----------+------+
|      Ajay| 56000|
|  Srikanth| 89078|
|     Reddy| 76890|
|Gursaidutt| 98023|
+----------+------+

+----------+------+
|      Name|Salary|
+----------+------+
|      Ajay| 56000|
|  Srikanth| 89078|
|     Reddy| 76890|
|Gursaidutt| 98023|
+----------+------+

Method 2: Add a singular row to an empty DataFrame by converting the row into a DataFrame

We can use createDataFrame() to convert a single row in the form of a Python List. The details of createDataFrame() are :

Syntax : CurrentSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)

Parameters :

data :

schema : str/list , optional: Contains a String or List of column names.

samplingRatio : float, optional: A sample of rows for inference

verifySchema : bool, optional: Verify data types of every row against the specified schema. The value is True by default.

Example:

In this example, we create a DataFrame with a particular schema and single row and create an EMPTY DataFrame with the same schema using createDataFrame(), do a union of these two DataFrames using union() function further store the above result in the earlier empty DataFrame and use show() to see the changes.

Python

# Importing PySpark and the SparkSession,
# DataType functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Creating a spark session
spark_session = SparkSession.builder.appName(
    'Spark_Session').getOrCreate()
 
# Creating an empty RDD to make a DataFrame
# with no data
emp_RDD = spark_session.sparkContext.emptyRDD()
 
# Defining the schema of the DataFrame
columns = StructType([StructField('Stadium', StringType(), False),
                      StructField('Capacity', IntegerType(), False)])
 
# Creating an empty DataFrame
df = spark_session.createDataFrame(data=emp_RDD,
                                   schema=columns)
 
# Printing the DataFrame with no data
df.show()
 
# Hardcoded row for the second DataFrame
added_row = [['Motera Stadium', 132000]]
 
# Creating the DataFrame
added_df = spark_session.createDataFrame(added_row, columns)
 
# Storing the union of first_df and second_df 
# in first_df
df = df.union(added_df)
 
# Our first DataFrame that was empty,
# now has data
df.show()

Output :

+-------+--------+
|Stadium|Capacity|
+-------+--------+
+-------+--------+

+--------------+--------+
|       Stadium|Capacity|
+--------------+--------+
|Motera Stadium|  132000|
+--------------+--------+

Method 3: Convert the empty DataFrame into a Pandas DataFrame and use the append() function

We will use toPandas() to convert PySpark DataFrame to Pandas DataFrame. Its syntax is :

Syntax : PySparkDataFrame.toPandas()

Returns : Corresponding Pandas DataFrame

We will then use the Pandas append() function. Its syntax is :

Syntax : PandasDataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)

Parameters :

other : Pandas DataFrame, Numpy Series etc: The data that has to be appended.

ignore_index : bool: If indexed a ignored then the indexes of the new DataFrame will have no relations to the older ones.

sort : bool: Sort the columns if alignment of the columns in other and PandasDataFrame is different.

Example:

Here we create an empty DataFrame where data is to be added, then we convert the data to be added into a Spark DataFrame using createDataFrame() and further convert both DataFrames to a Pandas DataFrame using toPandas() and use the append() function to add the non-empty data frame to the empty DataFrame and ignore the indexes as we are getting a new DataFrame.Finally, we convert our final Pandas DataFrame to a Spark DataFrame using createDataFrame().

Python

# Importing PySpark and the SparkSession,
# DataType functionality
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Creating a spark session
spark_session = SparkSession.builder.appName(
    'Spark_Session').getOrCreate()
 
# Creating an empty RDD to make a DataFrame
# with no data
emp_RDD = spark_session.sparkContext.emptyRDD()
 
# Defining the schema of the DataFrame
columns = StructType([StructField('Stadium', StringType(), False),
                      StructField('Capacity', IntegerType(), False)])
 
# Creating an empty DataFrame
df = spark_session.createDataFrame(data=emp_RDD,
                                   schema=columns)
 
# Printing the DataFrame with no data
df.show()
 
# Hardcoded row for the second DataFrame
added_row = [['Motera Stadium', 132000]]
 
# Creating the DataFrame whose data
# needs to be added
added_df = spark_session.createDataFrame(added_row,
                                         columns)
 
# converting our PySpark DataFrames to
# Pandas DataFrames
pandas_added = added_df.toPandas()
df = df.toPandas()
 
# using append() function to add the data
df = df.append(pandas_added, ignore_index=True)
 
# reconverting our DataFrame back
# to a PySpark DataFrame
df = spark_session.createDataFrame(df)
 
# Printing resultant DataFrame
df.show()

Output :

+-------+--------+
|Stadium|Capacity|
+-------+--------+
+-------+--------+

+--------------+--------+
|       Stadium|Capacity|
+--------------+--------+
|Motera Stadium|  132000|
+--------------+--------+

Suggest improvement

Show distinct column values in PySpark dataframe

How take a random row from a PySpark DataFrame?

Share your thoughts in the comments

Append data to an empty dataframe in PySpark

Method 1: Make an empty DataFrame and make a union with a non-empty DataFrame with the same schema

Python

Method 2: Add a singular row to an empty DataFrame by converting the row into a DataFrame

Python

Method 3: Convert the empty DataFrame into a Pandas DataFrame and use the append() function

Python

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?