Open In App

Convert pair to value using map() in Pyspark

Last Updated : 05 Feb, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn how to use map() to convert (key, value) pair to value and keys only using Pyspark in Python.

PySpark is the Python library for Spark programming. It is an API for interacting with the Spark cluster using the Python programming language. PySpark provides a simple and easy-to-use API for distributed data processing, machine learning, and graph processing using the power of Apache Spark.

map() function

The map() function is one of the core operations in PySpark. It is a transformation operation that applies a function to each element of an RDD (Resilient Distributed Dataset) and returns a new RDD containing the results. The function passed as an argument to the map() function takes a single argument, which is an element of the RDD, and returns a new element. The map() function is a transformation operation that applies a function to each element of an RDD (Resilient Distributed Dataset) and returns a new RDD containing the results.

Example 1

In this example, we are going to convert key-value pairs into values only. Firstly importing the required module after that store the key-value pairs in the variable kv_rdd with the elements (“a”,1) (“b”,2), and (“c”,3), and then the map() function is applied to kv_rdd RDD, which is a key-value pair RDD using the lambda function passed as an argument to map() takes a single argument x, which is a key-value pair, and returns only the value x[1]. This creates a new RDD containing only the values of the original RDD. Then the collect() method is used to retrieve the values of the new RDD and store them in the variable “values”.

Python3




# Import required module
from pyspark import SparkContext
  
sc = SparkContext()
  
# Create a key-value pair RDD
kv_rdd = sc.parallelize([(1, 'a'),
                         (2, 'b'),
                         (3, 'c')])
  
# Use map() to convert the RDD to an
# RDD containing only the values
value_rdd = kv_rdd.map(lambda x: x[1])
  
# Collect the values and print them
values = value_rdd.collect()
  
# Print values
print(values)


Output:

['a', 'b', 'c']

Example 2

In this example, we are going to convert key-value pairs to keys only. To do so we have to follow the same procedure same as in the first example but in the map() function we have to iterate over the keys of the key-value pair using the lambda expression.

Python3




# Import required module
from pyspark import SparkContext
  
sc = SparkContext()
  
# Create a key-value pair RDD
kv_rdd = sc.parallelize([(1, 'a'),
                         (2, 'b'),
                         (3, 'c')])
  
# Use map() to convert the RDD to an
# RDD containing keys and values
keys_rdd = kv_rdd.map(lambda x:x[0])
  
# Collect the keys and values and print them
keys = keys_rdd.collect()
  
print(keys)


Output:

[1, 2, 3]

Example 3

In this example, we are going to convert the key-value pair into keys and values as a single entity. To perform this task the lambda function passed as an argument to map() takes a single argument x, which is a key-value pair, and returns the key value too. We store the keys and values separately in the list with the help of list comprehension using collect() function.

Python3




# Import required module
from pyspark import SparkContext
  
sc = SparkContext()
  
# Create a key-value pair RDD
kv_rdd = sc.parallelize([(1, 'a'),
                         (2, 'b'),
                         (3, 'c')])
  
# Use map() to convert the RDD to an 
# RDD containing keys and values
value_rdd = kv_rdd.map(lambda x:x)
  
# Collect the keys and values and print them
values =[item for t in value_rdd.collect() for item in t] 
  
print(values)


Output:

[1, 'a', 2, 'b', 3, 'c']


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads