Open In App

Create new column with function in Spark Dataframe

Last Updated : 05 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn how to create a new column with a function in the PySpark data frame in Python.

PySpark is a popular Python library for distributed data processing that provides high-level APIs for working with big data. The data frame class is a key component of PySpark, as it allows you to manipulate tabular data with distributed computing. In this article, we will need to have PySpark installed and be familiar with basic data frame operations.

Define a function:

In this, we have defined a function that we can use to create a new column. This function takes an age as input and returns a string indicating whether the person is a ” child ” or an ” adult “.

Python3




# Defining a function
def age_group(age):
    if age < 18:
        return "child"
    else:
        return "adult"
  
# Storing return output of age_group function
# in age_group_udf
age_group_udf = udf(age_group)


Create a new column with a function using the withColumn() method in PySpark

In this column, we are going to add a new column to a data frame by defining a custom function and applying it to the data frame using a UDF. The UDF takes a column of the data frame as input, applies the custom function to it, and returns the result as a new column.

Here are the steps for using the withColumn() method to create a new column called “age_group” in our data frame:

Python3




from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
  
# create a SparkSession
spark = SparkSession.builder.getOrCreate()
  
# create a list of tuples
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
  
# create a DataFrame from the list
df = spark.createDataFrame(data, ["name", "age"])
print("Dataframe Before adding new col : ")
# view the DataFrame
df.show()
  
  
def age_group(age):
    if age < 18:
        return "child"
    else:
        return "adult"
  
age_group_udf = udf(age_group)
  
# create a new column called "age_group"
  
df = df.withColumn("age_group", age_group_udf(df.age))
  
print("Dataframe After adding new col : ")
# view the DataFrame with the new column
df.show()


Output : 

 

Create a new column with a function using the PySpark UDFs method

In this approach, we are going to add a new column to a data frame by defining a custom function and registering it as a UDF using the spark.udf.register() method. Then using selectExpr() method of the data frame to select the columns of the data frame and the new column which is created by applying the UDF on a column of DataFrame. Here is a step-by-step explanation of the code:

Python3




# Import required modules
from pyspark.sql import SparkSession
  
# Creating the SparkSession
spark = SparkSession.builder.appName(
  "Creating new column using UDF").getOrCreate()
  
# Creating the DataFrame
data = [("John", 25), ("Mike", 30),
        ("Emily", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
print("Dataframe Before adding new col : ")
df.show()
  
# Define the UDF
def my_udf(age):
    if age < 30:
        return "Young"
    else:
        return "Old"
  
# Register the UDF as a function
udf_age_group = spark.udf.register(
  "age_group", my_udf)
  
# Use the UDF in a select statement
df = df.selectExpr("name","age",
                   "age_group(age) as age_group")
  
print("Dataframe After adding new col : ")
# Show the DataFrame
df.show()


Output : 

 

 



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads