Applying function to PySpark Dataframe Column

Last Updated : 05 Feb, 2023

In this article, we’re going to learn ‘How we can apply a function to a PySpark DataFrame Column’.

Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes PySpark more powerful is its capacity to handle big data.

Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.

Install required module

Run the below command in the command prompt or terminal to install the Pyspark and pandas modules:

pip install pyspark
pip install pandas

Applying a Function on a PySpark DataFrame Column

Herein we will look at how we can apply a function on a PySpark DataFrame Column. For this purpose, we will be making use of ‘pandas_udf()’ present in ‘pyspark.sql.functions’.

Syntax:
# defining function
@pandas_udf(‘function_type’)
def function_name(argument: argument_type) -> result_type:
function_content
# applying function
DataFrame.select(function_name(specific_DataFrame_column)).show()

Example 1: Adding ‘s’ to every element in the column of DataFrame

Here in, we will be applying a function that will return the same elements but an additional ‘s’ added to them. Let’s look at the steps:

Import PySpark module
Import pandas_udf from pyspark.sql.functions.
Initialize the SparkSession.
Use the pandas_udf as the decorator.
Define the function.
Create a DataFrame.
Use .select method over the DataFrame
and as its argument, type-in the function_name along with its parameter as the specific column you want to apply the function on.

Python3

# importing SparkSession to initialize session 
from pyspark.sql import SparkSession 
# importing pandas_udf 
from pyspark.sql.functions import pandas_udf 
# importing Row to create DataFrame 
from pyspark import Row 
  
# initialising spark session 
spark = SparkSession.builder.getOrCreate() 
  
# creating DataFrame 
df = spark.createDataFrame([ 
      Row(fruits='apple', quantity=1), 
      Row(fruits='banana', quantity=2), 
      Row(fruits='orange', quantity=4) 
]) 
  
# printing our created DataFrame 
df.show()

Output:

+------+--------+
|fruits|quantity|
+------+--------+
| apple|       1|
|banana|       2|
|orange|       4|
+------+--------+

Now, let’s apply the function to the ‘fruits’ columns of this DataFrame.

Python3

# pandas UDF with the function Type as 'String' 
@pandas_udf('string')  
def adding_s(s: pd.Series) -> pd.Series: # function 
  return (s +'s') # concatenating the element string and 's' 
  
# applying the above function on the 
# 'fruits' column of 'df' DataFrame 
df.select(adding_s('fruits')).show()

Output:

+----------------+
|adding_s(fruits)|
+----------------+
|          apples|
|         bananas|
|         oranges|
+----------------+

Example 2: Capitalizing each element in the ‘fruits’ column

Herein, we will capitalize each element in the ‘fruits’ columns of the same DataFrame from the last example. Let’s look at the steps to do that:

Use the pandas_udf as the decorator.
Define the function.
Use .select method over the DataFrame and as its argument, type-in the function_name along with its parameter as the specific column we want to apply the function on.

Python3

@pandas_udf('string') 
def capitalize(s1: pd.Series) -> pd.Series: 
  # Here we are using s1.'str'.capitalize() as  
  # s1 is a pandas Series object and it  
  # doesn't contain capitalize() method.  
  # It is a string method, that's why we have written 
  # s1.str.capitalize() 
  return (s1.str.capitalize()) 
  
df.select(capitalize('fruits')).show()

Output:

+------------------+
|capitalize(fruits)|
+------------------+
|             Apple|
|            Banana|
|            Orange|
+------------------+

Example 3: Square of each element in the ‘quantity’ column of ‘df’ DataFrame

Herein, we will create a function that will return the squares of numbers in the ‘quantity’ column. Let’s look at the steps:

Import Iterator from typing.
Use pandas_udf() as Decorator.
Define the Function.
Use .select method over the DataFrame and as its argument, type-in the function_name along with its parameter as the specific column you want to apply the function on.

Python3

from typing import Iterator 
  
@pandas_udf('long') 
def square(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]: 
  for x in iterator: 
    yield x*x 
  
df.select(square('quantity')).show()

Output:

+----------------+
|square(quantity)|
+----------------+
|               1|
|               4|
|              16|
+----------------+

Example 4: Multiplying Each element of ‘quantity’ column with 10

We will follow all the same steps as above but we will change the function slightly.

Python3

from typing import Iterator 
  
@pandas_udf('long') 
def multiply_by_10(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]: 
  for x in iterator: 
    # multiplying each element by 10 
    yield x*10 
      
df.select(multiply_by_10('quantity')).show()

Output:

+------------------------+
|multiply_by_10(quantity)|
+------------------------+
|                      10|
|                      20|
|                      40|
+------------------------+

Suggest improvement

Calculate Time Difference in Python

Placeholders in jinja2 Template - Python

Share your thoughts in the comments

Applying function to PySpark Dataframe Column

Install required module

Applying a Function on a PySpark DataFrame Column

Example 1: Adding ‘s’ to every element in the column of DataFrame

Python3

Python3

Example 2: Capitalizing each element in the ‘fruits’ column

Python3

Example 3: Square of each element in the ‘quantity’ column of ‘df’ DataFrame

Python3

Example 4: Multiplying Each element of ‘quantity’ column with 10

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?