PySpark Row-Wise Function Composition
An interface for Apache Spark in Python is known as Pyspark. While coding in Pyspark, have you ever felt the need to apply the function row-wise and produce the result? Don’t know how to achieve this? Continue reading this article further. In this article, we will discuss how to apply row-wise function composition on Pyspark data frame in Python.
PySpark Row-Wise Function Composition
Udf() method will use the lambda function to loop over data, and its argument will accept the lambda function, and the lambda value will become an argument for the function, we want to make as a UDF.
Syntax: udf(lambda #parameters: #action_to_perform_on_parameters, IntegerType())
First, import the required libraries, i.e. SparkSession, SQLContext, UDF, struct, and IntegerType. The SparkSession library is used to create the session, while the SQLContext library is used to create the main entry point for the data frame. The UDF library is used to write Python code and call it as though it were a SQL function, while the struct returns a string packed according to the given format. Also, IntegerType is used to convert an internal SQL object into a native Python object.
Now, create a spark session using the getOrCreate function. Then, create a main entry point for the data frame using the SQLContext function. Next, either create a data frame using the createDataFrame function or read the CSV file using the read.csv function. Later on, create a function which will be called and new column will be created. Further, call the function created in the previous step and create the column with a certain heading. Finally, display the updated data frame in the previous step.
Implementation:
In this example, we have created the data frame of 3*4 with values of 0 and 1. Then, we created two functions, first function count_zeros to count the number of zeros in each row and second function count_ones to count the number of ones in each row. Finally, we have created two new columns ‘Zero Count’ and ‘One Count’ and called the respective functions in these columns.
Python3
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType
spark_session = SparkSession.builder.getOrCreate()
sqlContext = SQLContext(spark_session)
data_frame = sqlContext.createDataFrame(
[( 0 , 0 , 0 ), ( 1 , 0 , 0 ), ( 0 , 1 , 1 ), ( 1 , 1 , 1 )],
( "X" , "Y" , "Z" ))
count_zeros = udf( lambda row: len ([i for i in row if i = = 0 ]),
IntegerType())
count_ones = udf( lambda row: len ([j for j in row if j = = 1 ]),
IntegerType())
updated_data_frame_1 = data_frame.withColumn( "Zero count" ,
count_zeros(
struct([data_frame[x] for x in data_frame.columns])))
updated_data_frame_2 = updated_data_frame_1.withColumn( "One count" ,
count_ones(
struct([updated_data_frame_1[x] for x in updated_data_frame_1.columns])))
updated_data_frame_2.show()
|
Output:
+---+---+---+----------+---------+
| X| Y| Z|Zero count|One count|
+---+---+---+----------+---------+
| 0| 0| 0| 3| 0|
| 1| 0| 0| 2| 1|
| 0| 1| 1| 1| 3|
| 1| 1| 1| 0| 3|
+---+---+---+----------+---------+
Last Updated :
28 Dec, 2022
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...