Open In App

PySpark Row-Wise Function Composition

Improve
Improve
Like Article
Like
Save
Share
Report

An interface for Apache Spark in Python is known as Pyspark. While coding in Pyspark, have you ever felt the need to apply the function row-wise and produce the result? Don’t know how to achieve this? Continue reading this article further. In this article, we will discuss how to apply row-wise function composition on Pyspark data frame  in Python.

PySpark Row-Wise Function Composition

Udf() method will use the lambda function to loop over data, and its argument will accept the lambda function, and the lambda value will become an argument for the function, we want to make as a UDF.

Syntax: udf(lambda #parameters: #action_to_perform_on_parameters, IntegerType())

First, import the required libraries, i.e. SparkSession, SQLContext, UDF, struct, and IntegerType. The SparkSession library is used to create the session, while the SQLContext library is used to create the main entry point for the data frame. The UDF library is used to write Python code and call it as though it were a SQL function, while the struct returns a string packed according to the given format. Also, IntegerType is used to convert an internal SQL object into a native Python object.

Now, create a spark session using the getOrCreate function. Then, create a main entry point for the data frame using the SQLContext function. Next, either create a data frame using the createDataFrame function or read the CSV file using the read.csv function. Later on, create a function which will be called and new column will be created. Further, call the function created in the previous step and create the column with a certain heading. Finally, display the updated data frame in the previous step.

Implementation:

In this example, we have created the data frame of 3*4 with values of 0 and 1. Then, we created two functions, first function count_zeros to count the number of zeros in each row and second function count_ones to count the number of ones in each row. Finally, we have created two new columns ‘Zero Count’ and ‘One Count’ and called the respective functions in these columns.

Python3




# Python program to implement Pyspark
# row-wise function composition
  
# Import the SparkSession, SQLContext,
# udf, struct IntegerType libraries
from pyspark.sql import SparkSession, SQLContext
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Create a main entry point for data
# frame using SQLContext function
sqlContext = SQLContext(spark_session)
  
# Create a data frame using createDataFrame function
data_frame = sqlContext.createDataFrame(
    [(0, 0, 0), (1, 0, 0), (0, 1, 1), (1, 1, 1)],
    ("X", "Y", "Z"))
  
# Create a function to calculate the zeros
count_zeros = udf(lambda row: len([i for i in row if i == 0]),
                  IntegerType())
  
# Create a function to calculate the ones
count_ones = udf(lambda row: len([j for j in row if j == 1]),
                 IntegerType())
  
# Call the function created in step
# 5 and create the column 'Zero Count'
updated_data_frame_1 = data_frame.withColumn("Zero count",
                                             count_zeros(
                                                 struct([data_frame[x] for x in data_frame.columns])))
  
# Call the function created in step 6
# and create the column 'One Count'
updated_data_frame_2 = updated_data_frame_1.withColumn("One count",
                                                       count_ones(
                                                           struct([updated_data_frame_1[x] for x in updated_data_frame_1.columns])))
  
# Show the updated data frame
updated_data_frame_2.show()


Output:

+---+---+---+----------+---------+
|  X|  Y|  Z|Zero count|One count|
+---+---+---+----------+---------+
|  0|  0|  0|         3|        0|
|  1|  0|  0|         2|        1|
|  0|  1|  1|         1|        3|
|  1|  1|  1|         0|        3|
+---+---+---+----------+---------+


Last Updated : 28 Dec, 2022
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads