Defining DataFrame Schema with StructField and StructType
In this article, we will learn how to define DataFrame Schema with StructField and StructType.
- The StructType and StructFields are used to define a schema or its part for the Dataframe. This defines the name, datatype, and nullable flag for each column.
- StructType object is the collection of StructFields objects. It is a Built-in datatype that contains the list of StructField.
- pyspark.sql.types.StructField(name, datatype,nullable=True)
- fields – List of StructField.
- name – Name of the column.
- datatype – type of data i.e, Integer, String, Float etc.
- nullable – whether fields are NULL/None or not.
For defining schema we have to use the StructType() object in which we have to define or pass the StructField() which contains the name of the column, datatype of the column, and the nullable flag. We can write:-
schema = StructType([StructField(column_name1,datatype(),nullable_flag), StructField(column_name2,datatype(),nullable_flag), StructField(column_name3,datatype(),nullable_flag) ])
Example 1: Defining DataFrame with schema with StructType and StructField.
In the above code, we made the nullable flag=True. The use of making it True is that if while creating Dataframe any field value is NULL/None then also Dataframe will be created with none value.
Example 2: Defining Dataframe schema with nested StructType.
Example 3: Changing Structure of Dataframe and adding new column Using PySpark Column Class.
- In the above example, we are changing the structure of the Dataframe using struct() function and copy the column into the new struct ‘Product’ and creating the Product column using withColumn() function.
- After copying the ‘Product Name’, ‘Product ID’, ‘Rating’, ‘Product Price’ to the new struct ‘Product’.
- We are adding the new column ‘Price Range’ using withColumn() function, according to the given condition that is split into three categories i.e, Low, Medium, and High. If ‘Product Price’ is less than 1000 then that product falls in the Low category and if ‘Product Price’ is less than 7000 then that product falls in the Medium category otherwise that product fall in the High category.
- After creating the new struct ‘Product’ and adding the new column ‘Price Range’ we have to drop the ‘Product Name’, ‘Product ID’, ‘Rating’, ‘Product Price’ column using the drop() function. Then printing the schema with changed Dataframe structure and added columns.
Example 4: Defining Dataframe schema using the JSON format and StructType().
Note: You can also store the JSON format in the file and use the file for defining the schema, code for this is also the same as above only you have to pass the JSON file in loads() function, in the above example, the schema in JSON format is stored in a variable, and we are using that variable for defining schema.
Example 5: Defining Dataframe schema using StructType() with ArrayType() and MapType().