How to introduce the schema in a Row in Spark?

The type of data, field names, and field types in a table are defined by a schema, which is a structured definition of a dataset. In Spark, a row’s structure in a data frame is defined by its schema. To carry out numerous tasks including data filtering, joining, and querying a schema is necessary.

Concepts related to the topic

StructType: StructType is a class that specifies a DataFrame’s schema. Each StructField in the list corresponds to a field in the DataFrame.
StructField: The name, data type, and nullable flag of a field in a DataFrame are all specified by the class known as StructField.
DataFrame: A distributed collection of data with named columns is referred to as a data frame. It can be modified using different SQL operations and is similar to a table in a relational database.

Examples 1:

Step 1: Load the necessary libraries and functions and Create a SparkSession object

Python3

from pyspark.sql import SparkSession 

from pyspark.sql.types import StructType, StructField, IntegerType, StringType 

from pyspark.sql import Row 

# Create a SparkSession object 

spark = SparkSession.builder.appName("Schema").getOrCreate() 
spark

Output:

SparkSession - in-memory
SparkContext

Spark UI
Version
v3.3.1
Master
local[*]
AppName
Schema

Step 2: Define the schema

Python3

# Define the schema 

schema = StructType([ 

    StructField("id", IntegerType(), True), 

    StructField("name", StringType(), True), 

    StructField("age", IntegerType(), True) 
])

Step 3: List of employee data with 5-row values

Python3

# list of employee data with 5 row values 

data = [[101, "Sravan", 23], 

        [102, "Akshat", 25], 

        [103, "Pawan",  25], 

        [104, "Gunjan", 24], 

        [105, "Ritesh", 26]]

Step 4: Create a data frame from the data and the schema, and print the data frame

Python3

# Create a DataFrame from the Row object and the schema 

df = spark.createDataFrame(data, schema=schema) 
# Show the DataFrame 
df.show()

Output:

+---+------+---+
| id|  name|age|
+---+------+---+
|101|Sravan| 23|
|102|Akshat| 25|
|103| Pawan| 25|
|104|Gunjan| 24|
|105|Ritesh| 26|
+---+------+---+

Step 5: Print the schema

Python3

# print the schema 
df.printSchema()

Output:

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Step 6: Stop the SparkSession

Python3

# Stop the SparkSession 
spark.stop()

Example 2:

Steps needed

Create a StructType object defining the schema of the DataFrame.
Create a list of StructField objects representing each column in the DataFrame.
Create a Row object by passing the values of the columns in the same order as the schema.
Create a DataFrame from the Row object and the schema using the createDataFrame() function.

Creating a data frame with multiple columns of different types using schema.

Python3

from pyspark.sql import SparkSession 

from pyspark.sql.types import StructType, StructField, IntegerType, StringType 

from pyspark.sql import Row 

# Create a SparkSession object 

spark = SparkSession.builder.appName("example").getOrCreate() 

# Define the schema 

schema = StructType([ 

    StructField("id", IntegerType(), True), 

    StructField("name", StringType(), True), 

    StructField("age", IntegerType(), True) 
]) 

# Create a Row object 

row = Row(id=100, name="Akshat", age=19) 

# Create a DataFrame from the Row object and the schema 

df = spark.createDataFrame([row], schema=schema) 

# Show the DataFrame 
df.show() 

# print the schema 
df.printSchema() 

# Stop the SparkSession 
spark.stop()

Output

+---+------+---+
| id|  name|age|
+---+------+---+
|100|Akshat| 19|
+---+------+---+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Article Tags :

Data Science

python

Python-Pyspark