Apply function to all values in array column in PySpark
Last Updated :
23 Jan, 2023
A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python.
The columns on the Pyspark data frame can be of any type, IntegerType, StringType, ArrayType, etc. Do you know for an ArrayType column, you can apply a function to all the values in the array? This can be achieved by creating a user-defined function and calling that function to create a new column in the data frame. In this article, we have discussed the same.
Creating a data frame for demonstration:
In this example, we define the libraries, SparkSession and Row. Then, we created a data frame using spark context row-wise with four columns ‘Roll_Number‘, ‘Full_Name‘, ‘Marks‘, and ‘Subjects‘. The ‘Full_Name‘, ‘Marks‘, and ‘Subjects‘ columns are ArrayType columns that have three elements in them.
Python3
from pyspark.sql import SparkSession, Row
spark_session = SparkSession.builder.getOrCreate()
sc = spark_session.sparkContext
df = sc.parallelize([Row(Roll_Number = 1 ,
Full_Name = [ 'Arun' , 'Kumar' , 'Chaudhary' ],
Marks = [ 95 , 58 , 63 ],
Subjects = [ 'Maths' , 'Physics' , 'Chemistry' ]),
Row(Roll_Number = 2 ,
Full_Name = [ 'Aniket' , 'Singh' , 'Rajpoot' ],
Marks = [ 87 , 69 , 56 ],
Subjects = [ 'History' , 'Geography' , 'Arts' ]),
Row(Roll_Number = 3 ,
Full_Name = [ 'Ishita' , 'Rai' , 'Pundir' ],
Marks = [ 49 , 75 , 98 ],
Subjects = [ 'Accounts' , 'Business Studies' ,
'Maths' ])]).toDF()
df.show(truncate = False )
|
Output:
Example 1:
In this example, using UDF, we defined a function, i.e., subtract 3 from each mark, to perform an operation on each element of an array. Later on, we called that function to create the new column ‘Updated Marks‘ and displayed the data frame.
Python3
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType
reduce_marks = udf( lambda x: [i - 3 for i in x], ArrayType(IntegerType()))
updated_df = df.withColumn( 'Updated Marks' , reduce_marks( 'Marks' ))
updated_df.show(truncate = False )
|
Output:
Example 2:
In this example, using UDF, we defined a function, i.e., convert string to upper case, to perform an operation on each element of an array. Later on, we called that function to create the new column ‘Updated_Full_Name‘ and displayed the data frame.
Python3
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
upper_case = udf( lambda x: [i.upper() for i in x],
ArrayType(StringType()))
updated_df = df.withColumn( 'Updated_Full_Name' , upper_case( 'Full_Name' ))
updated_df.show(truncate = False )
|
Output:
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...