PySpark – Apply custom schema to a DataFrame
Last Updated :
23 Jan, 2023
In this article, we are going to apply custom schema to a data frame using Pyspark in Python.
A distributed collection of rows under named columns is known as a Pyspark data frame. Usually, the schema of the Pyspark data frame is inferred from the data frame itself, but Pyspark also gives the feature to customize the schema according to the needs. This can be done easily by defining the new schema and by loading it into the respective data frame. Read the article further to know about it in detail.
What is Schema?
The structure of the data frame which we can get by calling the printSchema() method on the data frame object is known as the Schema in Pyspark. Basically, schema defines the structure of the data frame such as data type of a column and boolean value indication (If column’s value can be null or not). The schema can be defined by using the StructType class which is a collection of StructField that defines the column name, column type, nullable column, and metadata.
Methods to apply custom schema to a Pyspark DataFrame
- Applying custom schema by changing the name.
- Applying custom schema by changing the type.
- Applying custom schema by changing the metadata.
Method 1: Applying custom schema by changing the name
As we know, whenever we create the data frame or upload the CSV file, it has some predefined schema, but if we don’t want it and want to change it according to our needs, then it is known as applying a custom schema. The custom schema has two fields ‘column_name‘ and ‘column_type‘. In this way, we will see how we can apply the customized schema to the data frame by changing the names in the schema.
Syntax: StructType(StructField(‘column_name_1’, column_type(), Boolean_indication))
Parameters:
- column_name_1, column_name_2: These are the column names given to the data frame while applying custom schema.
- column_type: These are the types to be given to columns while applying custom schema.
- Boolean_indication: It takes the input as ‘True’ or ‘False’ that defines whether the column contains null value or not.
Example:
In this example, we have defined the customized schema with columns ‘Student_Name’ of StringType, ‘Student_Age’ of IntegerType, ‘Student_Subject’ of StringType, ‘Student_Class’ of IntegerType, ‘Student_Fees’ of IntegerType. Then, we loaded the CSV file (link) whose schema is as follows:
Finally, we applied the customized schema to that CSV file by changing the names and displaying the updated schema of the data frame.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark_session = SparkSession.builder.getOrCreate()
schema = StructType([
StructField( 'Student_Name' ,
StringType(), True ),
StructField( 'Student_Age' ,
IntegerType(), True ),
StructField( 'Student_Subject' ,
StringType(), True ),
StructField( 'Student_Class' ,
IntegerType(), True ),
StructField( 'Student_Fees' ,
IntegerType(), True )
])
df = spark_session.read. format (
"csv" ).schema(schema).option(
"header" , True ).load( "/content/student_data.csv" )
df.printSchema()
|
Output:
root
|-- Student_Name: string (nullable = true)
|-- Student_Age: integer (nullable = true)
|-- Student_Subject: string (nullable = true)
|-- Student_Class: integer (nullable = true)
|-- Student_Fees: integer (nullable = true)
Method 2: Applying custom schema by changing the type
As you know, the custom schema has two fields ‘column_name‘ and ‘column_type‘. In a previous way, we saw how we can change the name in the schema of the data frame, now in this way, we will see how we can apply the customized schema to the data frame by changing the types in the schema.
Example:
In this example, we have read the CSV file (link), i.e., basically a dataset of 5*5, whose schema is as follows:
Then, we applied a custom schema by changing the type of column ‘fees‘ from Integer to Float using the cast function and printed the updated schema of the data frame.
Python3
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/student_data.csv' ,
sep = ',' , inferSchema = True , header = True )
data_frame1 = data_frame.withColumn( 'fees' ,
data_frame[ 'fees' ].cast( 'float' ))
data_frame1.printSchema()
|
Output:
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- subject: string (nullable = true)
|-- class: integer (nullable = true)
|-- fees: float (nullable = true)
Method 3: Applying custom schema by changing the metadata
The custom schema usually has two fields ‘column_name‘ and ‘column_type‘ but we can also define one other field, i.e., ‘metadata‘. The metadata is basically a small description of the column. In this way, we will see how we can apply the customized schema using metadata to the data frame.
Example:
In this example, we have defined the customized schema with columns ‘Student_Name’ of StringType with metadata ‘Name of the student’, ‘Student_Age’ of IntegerType with metadata ‘Age of the student’, ‘Student_Subject’ of StringType with metadata ‘Subject of the student’, ‘Student_Class’ of IntegerType with metadata ‘Class of the student’, ‘Student_Fees’ of IntegerType with metadata ‘Fees of the student’. Then, we loaded the CSV file (link) whose schema is as follows:
Finally, we applied the customized schema to that CSV file and displayed the schema of the data frame along with the metadata.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark_session = SparkSession.builder.getOrCreate()
schema = StructType([
StructField( 'Student_Name' , StringType(),
True , metadata = { "desc" : "Name of the student" }),
StructField( 'Student_Age' , IntegerType(),
True , metadata = { "desc" : "Age of the student" }),
StructField( 'Student_Subject' , StringType(),
True , metadata = { "desc" : "Subject of the student" }),
StructField( 'Student_Class' , IntegerType(),
True , metadata = { "desc" : "Class of the student" }),
StructField( 'Student_Fees' , IntegerType(),
True , metadata = { "desc" : "Fees of the student" })
])
df = spark_session.read. format ( "csv" ).schema(
schema).option( "header" ,
True ).load( "/content/student_data.csv" )
df.printSchema()
for i in range ( len (df.columns)):
a = df.schema.fields[i].metadata[ "desc" ]
print ( 'Column ' ,i + 1 , ': ' ,a)
|
Output:
root
|-- Student_Name: string (nullable = true)
|-- Student_Age: integer (nullable = true)
|-- Student_Subject: string (nullable = true)
|-- Student_Class: integer (nullable = true)
|-- Student_Fees: integer (nullable = true)
Column 1 : Name of the student
Column 2 : Age of the student
Column 3 : Subject of the student
Column 4 : Class of the student
Column 5 : Fees of the student
Share your thoughts in the comments
Please Login to comment...