Create new column with function in Spark Dataframe

Last Updated : 05 Feb, 2023

In this article, we are going to learn how to create a new column with a function in the PySpark data frame in Python.

PySpark is a popular Python library for distributed data processing that provides high-level APIs for working with big data. The data frame class is a key component of PySpark, as it allows you to manipulate tabular data with distributed computing. In this article, we will need to have PySpark installed and be familiar with basic data frame operations.

Define a function:

In this, we have defined a function that we can use to create a new column. This function takes an age as input and returns a string indicating whether the person is a ” child ” or an ” adult “.

Python3

# Defining a function 
def age_group(age): 
    if age < 18: 
        return "child"
    else: 
        return "adult"
  
# Storing return output of age_group function 
# in age_group_udf 
age_group_udf = udf(age_group) 

Create a new column with a function using the withColumn() method in PySpark

In this column, we are going to add a new column to a data frame by defining a custom function and applying it to the data frame using a UDF. The UDF takes a column of the data frame as input, applies the custom function to it, and returns the result as a new column.

Here are the steps for using the withColumn() method to create a new column called “age_group” in our data frame:

Python3

from pyspark.sql import SparkSession 
from pyspark.sql.functions import udf 
  
# create a SparkSession 
spark = SparkSession.builder.getOrCreate() 
  
# create a list of tuples 
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)] 
  
# create a DataFrame from the list 
df = spark.createDataFrame(data, ["name", "age"]) 
print("Dataframe Before adding new col : ") 
# view the DataFrame 
df.show() 
  
  
def age_group(age): 
    if age < 18: 
        return "child"
    else: 
        return "adult"
  
age_group_udf = udf(age_group) 
  
# create a new column called "age_group" 
  
df = df.withColumn("age_group", age_group_udf(df.age)) 
  
print("Dataframe After adding new col : ") 
# view the DataFrame with the new column 
df.show()

Output :

Create a new column with a function using the PySpark UDFs method

In this approach, we are going to add a new column to a data frame by defining a custom function and registering it as a UDF using the spark.udf.register() method. Then using selectExpr() method of the data frame to select the columns of the data frame and the new column which is created by applying the UDF on a column of DataFrame. Here is a step-by-step explanation of the code:

Python3

# Import required modules 
from pyspark.sql import SparkSession 
  
# Creating the SparkSession 
spark = SparkSession.builder.appName( 
  "Creating new column using UDF").getOrCreate() 
  
# Creating the DataFrame 
data = [("John", 25), ("Mike", 30), 
        ("Emily", 35)] 
columns = ["name", "age"] 
df = spark.createDataFrame(data, columns) 
print("Dataframe Before adding new col : ") 
df.show() 
  
# Define the UDF 
def my_udf(age): 
    if age < 30: 
        return "Young"
    else: 
        return "Old"
  
# Register the UDF as a function 
udf_age_group = spark.udf.register( 
  "age_group", my_udf) 
  
# Use the UDF in a select statement 
df = df.selectExpr("name","age", 
                   "age_group(age) as age_group") 
  
print("Dataframe After adding new col : ") 
# Show the DataFrame 
df.show()