Open In App

How to introduce the schema in a Row in Spark?

The type of data, field names, and field types in a table are defined by a schema, which is a structured definition of a dataset. In Spark, a row’s structure in a data frame is defined by its schema. To carry out numerous tasks including data filtering, joining, and querying a schema is necessary. 

Concepts related to the topic

  1. StructType: StructType is a class that specifies a DataFrame’s schema. Each StructField in the list corresponds to a field in the DataFrame.
  2. StructField: The name, data type, and nullable flag of a field in a DataFrame are all specified by the class known as StructField.
  3. DataFrame: A distributed collection of data with named columns is referred to as a data frame. It can be modified using different SQL operations and is similar to a table in a relational database.

Examples 1:

Step 1: Load the necessary libraries and functions and Create a SparkSession object 




from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import Row
  
# Create a SparkSession object
spark = SparkSession.builder.appName("Schema").getOrCreate()
spark

Output:



SparkSession - in-memory
SparkContext

Spark UI
Version
v3.3.1
Master
local[*]
AppName
Schema

Step 2: Define the schema




# Define the schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

Step 3: List of employee data with 5-row values




# list of employee data with 5 row values
data = [[101, "Sravan", 23],
        [102, "Akshat", 25],
        [103, "Pawan"25],
        [104, "Gunjan", 24],
        [105, "Ritesh", 26]]

Step 4:  Create a data frame from the data and the schema, and print the data frame




# Create a DataFrame from the Row object and the schema
df = spark.createDataFrame(data, schema=schema)
# Show the DataFrame
df.show()

Output:

+---+------+---+
| id|  name|age|
+---+------+---+
|101|Sravan| 23|
|102|Akshat| 25|
|103| Pawan| 25|
|104|Gunjan| 24|
|105|Ritesh| 26|
+---+------+---+

Step 5: Print the schema




# print the schema
df.printSchema()

Output:



root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Step 6: Stop the SparkSession




# Stop the SparkSession
spark.stop()

Example 2:

Steps needed

  1. Create a StructType object defining the schema of the DataFrame.
  2. Create a list of StructField objects representing each column in the DataFrame.
  3. Create a Row object by passing the values of the columns in the same order as the schema.
  4. Create a DataFrame from the Row object and the schema using the createDataFrame() function.

Creating a data frame with multiple columns of different types using schema.




from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import Row
  
# Create a SparkSession object
spark = SparkSession.builder.appName("example").getOrCreate()
  
# Define the schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])
  
# Create a Row object
row = Row(id=100, name="Akshat", age=19)
  
# Create a DataFrame from the Row object and the schema
df = spark.createDataFrame([row], schema=schema)
  
# Show the DataFrame
df.show()
  
# print the schema
df.printSchema()
  
# Stop the SparkSession
spark.stop()

Output

+---+------+---+
| id|  name|age|
+---+------+---+
|100|Akshat| 19|
+---+------+---+

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Article Tags :