Open In App

How to introduce the schema in a Row in Spark?

Last Updated : 09 Jun, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

The type of data, field names, and field types in a table are defined by a schema, which is a structured definition of a dataset. In Spark, a row’s structure in a data frame is defined by its schema. To carry out numerous tasks including data filtering, joining, and querying a schema is necessary. 

Concepts related to the topic

  1. StructType: StructType is a class that specifies a DataFrame’s schema. Each StructField in the list corresponds to a field in the DataFrame.
  2. StructField: The name, data type, and nullable flag of a field in a DataFrame are all specified by the class known as StructField.
  3. DataFrame: A distributed collection of data with named columns is referred to as a data frame. It can be modified using different SQL operations and is similar to a table in a relational database.

Examples 1:

Step 1: Load the necessary libraries and functions and Create a SparkSession object 

Python3




from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import Row
  
# Create a SparkSession object
spark = SparkSession.builder.appName("Schema").getOrCreate()
spark


Output:

SparkSession - in-memory
SparkContext

Spark UI
Version
v3.3.1
Master
local[*]
AppName
Schema

Step 2: Define the schema

Python3




# Define the schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])


Step 3: List of employee data with 5-row values

Python3




# list of employee data with 5 row values
data = [[101, "Sravan", 23],
        [102, "Akshat", 25],
        [103, "Pawan"25],
        [104, "Gunjan", 24],
        [105, "Ritesh", 26]]


Step 4:  Create a data frame from the data and the schema, and print the data frame

Python3




# Create a DataFrame from the Row object and the schema
df = spark.createDataFrame(data, schema=schema)
# Show the DataFrame
df.show()


Output:

+---+------+---+
| id|  name|age|
+---+------+---+
|101|Sravan| 23|
|102|Akshat| 25|
|103| Pawan| 25|
|104|Gunjan| 24|
|105|Ritesh| 26|
+---+------+---+

Step 5: Print the schema

Python3




# print the schema
df.printSchema()


Output:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Step 6: Stop the SparkSession

Python3




# Stop the SparkSession
spark.stop()


Example 2:

Steps needed

  1. Create a StructType object defining the schema of the DataFrame.
  2. Create a list of StructField objects representing each column in the DataFrame.
  3. Create a Row object by passing the values of the columns in the same order as the schema.
  4. Create a DataFrame from the Row object and the schema using the createDataFrame() function.

Creating a data frame with multiple columns of different types using schema.

Python3




from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import Row
  
# Create a SparkSession object
spark = SparkSession.builder.appName("example").getOrCreate()
  
# Define the schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])
  
# Create a Row object
row = Row(id=100, name="Akshat", age=19)
  
# Create a DataFrame from the Row object and the schema
df = spark.createDataFrame([row], schema=schema)
  
# Show the DataFrame
df.show()
  
# print the schema
df.printSchema()
  
# Stop the SparkSession
spark.stop()


Output

+---+------+---+
| id|  name|age|
+---+------+---+
|100|Akshat| 19|
+---+------+---+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads