PySpark map() Transformation

Last Updated : 17 Apr, 2023

In this article, we are going to learn about PySpark map() transformation in Python.

PySpark is a powerful open-source library that allows developers to use Python for big data processing. We will focus on one of the key transformations provided by PySpark, the map() transformation, which enables users to apply a function to each element in a dataset. This article will explain how the map() transformation works with examples.

How the map() transformation works

The map() transformation in PySpark is used to apply a function to each element in a dataset. This function takes a single element as input and returns a transformed element as output. The map() transformation returns a new dataset that consists of the transformed elements.

syntax :

rdd.map(map_function)

Here is a simple example of using the map() transformation to multiply each element in a dataset by 2:

data = [1, 2, 3, 4]
rdd = sc.parallelize(data)
rdd_transformed = rdd.map(lambda x: x * 2)

The resulting transformed dataset, rdd_transformed, would contain the following elements.

[2, 4, 6, 8].

PySpark map() transformation with CSV file

In this example, the map() transformation is used to apply the normalize() function to each element of the rdd that was created from the data frame. The resulting transformed rdd, rdd_normalized, contains the normalized feature values for each row of the data frame.

Download dataset: data.csv

Step 1: In the first step, we import the required library and create a SparkSession.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MapTransformationExample").getOrCreate()

Step 2: Read the dataset from a CSV file using the following line of code.

df = spark.read.csv("data.csv", header=True)

Step 3: The next step is to use the map() function to apply a function to each row of the data frame. In this case, the function is called extract_features() and is defined as follows:

def extract_features(row):
 return (float(row.feature1), float(row.feature2))

Step 4: This function takes a single row of the data frame as input and returns a tuple of two float values, which are the values of the “feature1” and “feature2” columns. The resulting transformed dataset is stored in the variable rdd.

rdd = df.rdd.map(extract_features)

Step 5: The next step is to use the map() transformation again to apply a function to each element of the rdd. In this case, the function is called normalize() and is defined as follows:

def normalize(row):
 try:
   return (row[0] / 10, row[1] / 100)
 except TypeError:
   # Return a default value if the input values are not numeric
   return (0, 0)

Step 6: This function takes a single tuple from the previous rdd as input, and returns a tuple of the same values divided by 10 and 100, respectively. The resulting transformed dataset is stored in the variable rdd_normalized.

rdd_normalized = rdd.map(normalize)

Step 7: The last step is to use the collect() action to retrieve the transformed elements of the rdd and print out the resulting normalized data using a for loop.

normalized_data = rdd_normalized.collect()
for row in normalized_data:
 print(row)

Step 8: Finally, the SparkSession is stopped with the following line of code:

spark.stop()

Code Implementation:

Python3

# Import required library 
from pyspark.sql import SparkSession 
  
# Create a SparkSession 
spark = SparkSession.builder.appName("Map\ 
TransformationExample").getOrCreate() 
  
# Read in a dataset 
df = spark.read.csv("data.csv", 
                    header=True) 
  
# Use map() to apply a function 
# to each row of the dataframe 
def extract_features(row): 
  return (float(row.feature1), 
          float(row.feature2)) 
  
rdd = df.rdd.map(extract_features) 
  
# Use the map() transformation to apply  
# a function to each element of the rdd 
def normalize(row): 
  try: 
    return (row[0] / 10, row[1] / 100) 
  except TypeError: 
    # Return a default value if the  
    # input values are not numeric 
    return (0, 0) 
  
  
rdd_normalized = rdd.map(normalize) 
  
# Use the collect() action to retrieve 
# the transformed elements of the rdd 
normalized_data = rdd_normalized.collect() 
  
# print transform data 
for row in normalized_data: 
  print(row) 
  
# Stop the SparkSession 
spark.stop() 

Dataset before map() transformation :

After applying map() transformation:

PySpark map() transformation with data frame

In this, we are going to use a data frame instead of CSV file and then apply the map() transformation to the data frame.

Step 1: Import the necessary modules:

from pyspark.sql import SparkSession

Step 2: Create a SparkSession. This creates a new SparkSession with the name “map_example”.

spark = SparkSession.builder.appName("map_example").getOrCreate()

Step 3: Create a data frame with sample data. This creates a data frame with two columns, “name” and “age”, and three rows of sample data.

data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
df = spark.createDataFrame(data, ["name", "age"])

Step 4: Define a function to be applied to each row. This function takes age as input and returns age+1.

def add_one(age):
   return age + 1

Step 5: Use the map() transformation to apply the function to the “age” column. This applies the add_one() function to each value in the “age” column using the map() transformation. The resulting data frame contains the modified “age” values.

df = df.rdd.map(lambda x: (x[0], add_one(x[1]))).toDF(["name", "age"])

Step 6: Print the resulting data frame.

df.show()

Code Implementation:

Python3

from pyspark.sql import SparkSession 
  
# Create a SparkSession 
spark = SparkSession.builder.appName("\ 
map_example").getOrCreate() 
  
# Create a DataFrame with sample data 
data = [("Alice", 1), 
        ("Bob", 2), ("Charlie", 3)] 
df = spark.createDataFrame(data, 
                           ["name", "age"]) 
  
# Define a function to be applied to each row 
def add_one(age): 
    return age + 1
  
# Use the map() transformation to apply  
# the function to the "age" column 
df = df.rdd.map(lambda x: (x[0], 
     add_one(x[1]))).toDF(["name", "age"]) 
  
# Show the resulting DataFrame 
df.show() 

Output :

Conclusion

The map() transformation is a valuable tool in PySpark for applying a function to each element in a dataset. It can be used to transform data in a variety of ways and is an important part of many PySpark programs. Overall, the map() transformation is a useful tool for data manipulation and transformation in PySpark, and its flexibility and power make it an essential part of any PySpark program. Whether we are working with large datasets or just need to apply a simple transformation to the data.

Suggest improvement

Apply a transformation to multiple columns PySpark dataframe

Share your thoughts in the comments