Create new column with function in Spark Dataframe
Last Updated :
05 Feb, 2023
In this article, we are going to learn how to create a new column with a function in the PySpark data frame in Python.
PySpark is a popular Python library for distributed data processing that provides high-level APIs for working with big data. The data frame class is a key component of PySpark, as it allows you to manipulate tabular data with distributed computing. In this article, we will need to have PySpark installed and be familiar with basic data frame operations.
Define a function:
In this, we have defined a function that we can use to create a new column. This function takes an age as input and returns a string indicating whether the person is a ” child ” or an ” adult “.
Python3
def age_group(age):
if age < 18 :
return "child"
else :
return "adult"
age_group_udf = udf(age_group)
|
Create a new column with a function using the withColumn() method in PySpark
In this column, we are going to add a new column to a data frame by defining a custom function and applying it to the data frame using a UDF. The UDF takes a column of the data frame as input, applies the custom function to it, and returns the result as a new column.
Here are the steps for using the withColumn() method to create a new column called “age_group” in our data frame:
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
spark = SparkSession.builder.getOrCreate()
data = [( "Alice" , 1 ), ( "Bob" , 2 ), ( "Charlie" , 3 )]
df = spark.createDataFrame(data, [ "name" , "age" ])
print ( "Dataframe Before adding new col : " )
df.show()
def age_group(age):
if age < 18 :
return "child"
else :
return "adult"
age_group_udf = udf(age_group)
df = df.withColumn( "age_group" , age_group_udf(df.age))
print ( "Dataframe After adding new col : " )
df.show()
|
Output :
Create a new column with a function using the PySpark UDFs method
In this approach, we are going to add a new column to a data frame by defining a custom function and registering it as a UDF using the spark.udf.register() method. Then using selectExpr() method of the data frame to select the columns of the data frame and the new column which is created by applying the UDF on a column of DataFrame. Here is a step-by-step explanation of the code:
Python3
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(
"Creating new column using UDF" ).getOrCreate()
data = [( "John" , 25 ), ( "Mike" , 30 ),
( "Emily" , 35 )]
columns = [ "name" , "age" ]
df = spark.createDataFrame(data, columns)
print ( "Dataframe Before adding new col : " )
df.show()
def my_udf(age):
if age < 30 :
return "Young"
else :
return "Old"
udf_age_group = spark.udf.register(
"age_group" , my_udf)
df = df.selectExpr( "name" , "age" ,
"age_group(age) as age_group" )
print ( "Dataframe After adding new col : " )
df.show()
|
Output :
Share your thoughts in the comments
Please Login to comment...