Open In App

Update Pyspark Dataframe Metadata

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn how to update the PySpark data frame in Python.

In this article, we will discuss how to update the metadata of a PySpark data frame. Specifically, we will cover the following topics:

  • Understanding the importance of metadata in PySpark DataFrames
  • How to access and view the metadata of a PySpark DataFrame
  • Different ways to update the metadata of a PySpark DataFrame
  • Best practices for managing metadata in PySpark DataFrames

By the end of this article, we will have a solid understanding of how to update the metadata of a PySpark DataFrame and how to effectively manage metadata in PySpark projects.

Importance of metadata in PySpark DataFrames

Metadata in a PySpark DataFrame refers to the information about the data such as column names, data types, and constraints. It is important because it provides crucial information about the structure and content of the data. This information is used by PySpark during operations such as querying, filtering, and joining. If the metadata is incorrect or inconsistent, it can lead to errors and unexpected results in PySpark operations. Furthermore, accurate metadata can improve the performance of PySpark operations by allowing the optimizer to make better decisions. It is important to keep the metadata accurate and up-to-date to ensure the proper functioning of PySpark DataFrames and the overall integrity of data.

How to access and view the metadata of a PySpark DataFrame

In PySpark, we can access the metadata of a DataFrame using the .schema attribute. This returns a StructType object, which contains the metadata for the DataFrame. We can view the metadata by calling the printSchema() method on the DataFrame. This will print the metadata in a tree format, showing the column names, data types, and whether a column is nullable or not.

Here is an example of accessing and viewing the metadata of a data frame:

Python3




# Importing required modules
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession
  
# Create a SparkSession
spark = SparkSession.builder.appName("Metadata").getOrCreate()
  
# Define schema of data frame
schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])
  
# Create data frame
data = [("Alice", 25),
        ("Bob", 30), 
        ("Charlie", 35)]
df = spark.createDataFrame(data, schema)
  
# Access and view the metadata
print(df.schema)
df.printSchema()


Output: The first line of output will be a StructType object and the following will be the tree format of the data frame’s metadata.

StructType([StructField('name', StringType(), True), StructField('age', IntegerType(), True)])
root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

We can also use the dtypes attribute to get the column name and data type information in tuple format.

df.dtypes

This will return a list of tuples, each containing the column name and data type.

[('name', 'string'), ('age', 'int')]

Different ways to update the metadata of a PySpark DataFrame

There are several ways to update the metadata of a PySpark DataFrame, depending on the specific change we need to make. Here are a few examples:

Change column names of a data frame in PySpark

This method is used to change the name of a column in the data frame. The withColumnRenamed() method is used to change the column name. In this, we are going to change the name of the “name” column to “username“. Below are steps to change the column name.

Step 1: Firstly we import all required modules and then create a spark session.

Step 2: Create a PySpark data frame with data and column names as “name” and “age”.

Step 3: Use the withColumnRenamed() method to change the name of the “name” column to “username”.

Step 4: Call the printSchema() method to print the schema of the DataFrame after the change which shows that the column name has been changed to “username”.

Python3




# Importing required modules
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession
  
# Create a SparkSession
spark = SparkSession.builder.appName("Metadata").getOrCreate()
  
# Create a DataFrame
schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])
# Create a DataFrame
data = [("Alice", 25),
        ("Bob", 30),
        ("Charlie", 35)]
df = spark.createDataFrame(data,
                           ["name", "age"])
# print schema of data frame
df.printSchema()
  
# Change column names
df = df.withColumnRenamed("name", "username")
df.printSchema()


Output before changing the column name : 

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

Output after changing the column name : 

root
 |-- username: string (nullable = true)
 |-- age: long (nullable = true)

Change data types in a data frame in PySpark

This method is used to change the data type of a column. The cast() method is used to change the data type of a column. For example, to change the data type of the “age” column from long to double we have to follow below steps.

Step 1: Firstly we import all required modules and then create a spark session.

Step 2: Create a data frame with data and column names as “name” and “age”.

Step 3: Use the withColumn() method along with cast() method to change the data type of the “age” column to double.

Step 4: Call the printSchema() method to print the schema of the DataFrame after the change which shows that the data type of the “age” column has been changed to double.

Python3




# Import required modules
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Metadata").getOrCreate()
# Create a DataFrame
schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])
# Create a DataFrame
data = [("Alice", 25),
        ("Bob", 30),
        ("Charlie", 35)]
df = spark.createDataFrame(data,
                           ["name", "age"])
df.printSchema()
  
# Change column names
from pyspark.sql.types import DoubleType
df = df.withColumn("age",
                   df["age"].cast(DoubleType()))
  
df.printSchema()


Output before changing data type : 

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

Output after changing data type : 

root
 |-- name: string (nullable = true)
 |-- age: double (nullable = true)

Add new columns in a data frame in PySpark

This method is used to add a new column to the data frame. The withColumn() method along with the lit() method is used to add a new column. In this, we are going to add a new column “gender” of string type.

Step 1: Firstly we import all required modules and then create a spark session.

Step 2: Create a data frame with data and column names as “name” and “age”.

Step 3: Use the withColumn() method along with the lit() method to add a new column “gender” of string type with default value “unknown”.

Step 4: Use printSchema() method to print the schema of the DataFrame after the change which shows that a new column “gender” has been added to the data frame of the string type.

Python3




# Import required modules
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Metadata").getOrCreate()
# Create a DataFrame
schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])
  
# Create a DataFrame
data = [("Alice", 25),
        ("Bob", 30),
        ("Charlie", 35)]
df = spark.createDataFrame(data,
                           ["name", "age"])
df.printSchema()
  
# Add column
from pyspark.sql.functions import lit
df = df.withColumn("gender",
                   lit("unknown"))
  
# Print Schema of data frame
df.printSchema()
df.show()


Output before new column added : 

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

Output after new column added  : 

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- gender: string (nullable = false)

+-------+---+-------+
|   name|age| gender|
+-------+---+-------+
|  Alice| 25|unknown|
|    Bob| 30|unknown|
|Charlie| 35|unknown|
+-------+---+-------+

Drop columns of a data frame in the Pyspark

In this, we are going to delete a column of a data frame using the drop() method which is used to remove a column from the data frame. We are going to delete the column “gender” that we created in the previous example. Here are the steps to do so.

Step 1: Firstly we import all required modules and then create a spark session.

Step 2: we create a data frame with data, column names as “name”, “age” and “gender”.

Step 3: Use the drop() method to remove the “gender” column from the data frame.

Step 4: Call printSchema() method to print the schema of the DataFrame after the change which shows that the “gender” column has been removed.

Python3




# Importing required modules
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession
  
# Create a SparkSession
spark = SparkSession.builder.appName("Metadata").getOrCreate()
  
# Create a DataFrame
schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])
  
# Create a DataFrame
data = [("Alice", 25, "female"),
        ("Bob", 30, "male"),
        ("Charlie", 35, "male")]
df = spark.createDataFrame(data,
                           ["name"
                            "age",
                            "gender"])
df.printSchema()
  
# Remove column
df = df.drop("gender")
df.printSchema()


Output before dropping the gender column: 

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- gender: string (nullable = true)

Output after dropping the gender column: 

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

Change column metadata of data frame in PySpark

In this, we are going to change the metadata of a specific column making it “nullable = true” to “nullable = false”. It can be done by creating a new schema object using the StructType class and passing it to the createDataFrame() method. Here are the steps to do so.

Step 1: Firstly we import all required modules and then create a spark session.

Step 2: Create a data frame with data, column names as “name” and “age”.

Step 3:  Create a new schema object by creating a list of fields with updated metadata, specifically making the columns not nullable.

Step 4: Create a new data frame using the createDataFrame() method and pass the RDD of the original data frame and new schema to it, which updates the metadata of the data frame.

Python3




# Importing required modules
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import SparkSession
  
# Create a SparkSession
spark = SparkSession.builder.appName("Metadata").getOrCreate()
  
# Create a DataFrame
data = [("Alice", 25),
        ("Bob", 30),
        ("Charlie", 35)]
  
df = spark.createDataFrame(data,
                           ["name", "age"])
df.printSchema()
  
# Change column metadata
fields = [StructField(field.name,
          field.dataType,
          False) for field in df.schema.fields]
  
# Store changed data frame in new_schema
new_schema = StructType(fields)
df = spark.createDataFrame(df.rdd,
                           new_schema)
df.printSchema()


Output before changing column metadata : 

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

Output after changing column metadata : 

root
 |-- name: string (nullable = false)
 |-- age: long (nullable = false)

Update schema using withMetadata() function in PySpark 

The withMetadata() function is not a built-in function in PySpark for updating the metadata of a DataFrame. However, we can update the metadata of a DataFrame using the withMetadata() function in a user-defined way.

Step 1: We start by importing the necessary modules, lit from pyspark.sql.functions, and JSON.

Step 2: Create a DataFrame with data and column names as “name” and “age” using the createDataFrame() method.

Step 3: We define a function withMetadata() that takes in two arguments: the DataFrame, and a dictionary of metadata.

Step 4: Inside the function, we update the metadata of the DataFrame using different operations like renaming columns, changing data types, adding and dropping columns, and changing column metadata.

Step 5: Convert the metadata passed as a dictionary to a JSON string using json.dumps() method.

Step 6: Add the metadata to the DataFrame by adding a new column “metadata” with the value of the passed metadata in JSON string format using withColumn() method and lit() function.

Step 7: Call the function withMetadata() and pass the DataFrame and the metadata as arguments.

Step 8: Call the printSchema() method to print the schema of the DataFrame after the changes, which shows that the new column “metadata” has been added with the passed metadata in JSON string format.

Here are the steps of how we can use the withMetadata() function to update the metadata of a DataFrame:

Python3




# Import required modules
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import lit
  
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Metadata").getOrCreate()
from pyspark.sql.functions import lit
import json
# Create a DataFrame
data = [("Alice", 25),
        ("Bob", 30),
        ("Charlie", 35)]
df = spark.createDataFrame(data,
                           ["name", "age"])
df.printSchema()
  
# Define a function to update the metadata
def withMetadata(df, metadata):
    # Update the metadata of the DataFrame
    df = df.withColumnRenamed("name",
                              "username")
    df = df.withColumn("age",
                       df["age"].cast("double"))
    df = df.withColumn("gender",
                       lit("unknown"))
    df = df.drop("gender")
    fields = [StructField(field.name,
              field.dataType,
              False) for field in df.schema.fields]
    new_schema = StructType(fields)
    df = spark.createDataFrame(df.rdd,
                               new_schema)
      
    # Add the metadata to the DataFrame
    df = df.withColumn("metadata",
                       lit(json.dumps(metadata)))
    return df
  
# Update the metadata of the DataFrame
df = withMetadata(df, {"source": "file",
                       "date": "2022-01-01"})
df.printSchema()


Output before calling withMetadata() function : 

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

Output after calling withMetadata() function : 

root
 |-- username: string (nullable = false)
 |-- age: double (nullable = false)
 |-- metadata: string (nullable = false)

Conclusion

To summarize, metadata in PySpark DataFrames refers to the information about the data such as column names, data types, and constraints. It is important because it provides crucial information about the structure and content of the data and is used by PySpark during operations such as querying, filtering, and joining. To update the metadata of a PySpark DataFrame.



Last Updated : 30 Jan, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads