How to create an empty PySpark DataFrame ?
Last Updated :
11 Aug, 2021
In this article, we are going to see how to create an empty PySpark dataframe. Empty Pysaprk dataframe is a dataframe containing no data and may or may not specify the schema of the dataframe.
Creating an empty RDD without schema
We’ll first create an empty RDD by specifying an empty schema.
- emptyRDD() method creates an RDD without any data.
- createDataFrame() method creates a pyspark dataframe with the specified data and schema of the dataframe.
Code:
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.appName( 'Empty_Dataframe' ).getOrCreate()
emp_RDD = spark.sparkContext.emptyRDD()
columns = StructType([])
data = spark.createDataFrame(data = emp_RDD,
schema = columns)
print ( 'Dataframe :' )
data.show()
print ( 'Schema :' )
data.printSchema()
|
Output:
Dataframe :
++
||
++
++
Schema :
root
Creating an emptyRDD with schema
It is possible that we will not get a file for processing. However, we must still manually create a DataFrame with the appropriate schema.
- Specify the schema of the dataframe as columns = [‘Name’, ‘Age’, ‘Gender’].
- Create an empty RDD with an expecting schema.
Code:
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.appName( 'Empty_Dataframe' ).getOrCreate()
emp_RDD = spark.sparkContext.emptyRDD()
columns = StructType([StructField( 'Name' ,
StringType(), True ),
StructField( 'Age' ,
StringType(), True ),
StructField( 'Gender' ,
StringType(), True )])
df = spark.createDataFrame(data = emp_RDD,
schema = columns)
print ( 'Dataframe :' )
df.show()
print ( 'Schema :' )
df.printSchema()
|
Output :
Dataframe :
+----+---+------+
|Name|Age|Gender|
+----+---+------+
+----+---+------+
Schema :
root
|-- Name: string (nullable = true)
|-- Age: string (nullable = true)
|-- Gender: string (nullable = true)
Creating an empty dataframe without schema
- Create an empty schema as columns.
- Specify data as empty([]) and schema as columns in CreateDataFrame() method.
Code:
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.appName( 'Empty_Dataframe' ).getOrCreate()
columns = StructType([])
df = spark.createDataFrame(data = [],
schema = columns)
print ( 'Dataframe :' )
df.show()
print ( 'Schema :' )
df.printSchema()
|
Output:
Dataframe :
++
||
++
++
Schema :
root
Creating an empty dataframe with schema
- Specify the schema of the dataframe as columns = [‘Name’, ‘Age’, ‘Gender’].
- Specify data as empty([]) and schema as columns in CreateDataFrame() method.
Code:
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.appName( 'Empty_Dataframe' ).getOrCreate()
columns = StructType([StructField( 'Name' ,
StringType(), True ),
StructField( 'Age' ,
StringType(), True ),
StructField( 'Gender' ,
StringType(), True )])
df = spark.createDataFrame(data = [],
schema = columns)
print ( 'Dataframe :' )
df.show()
print ( 'Schema :' )
df.printSchema()
|
Output :
Dataframe :
+----+---+------+
|Name|Age|Gender|
+----+---+------+
+----+---+------+
Schema :
root
|-- Name: string (nullable = true)
|-- Age: string (nullable = true)
|-- Gender: string (nullable = true)
Share your thoughts in the comments
Please Login to comment...