Convert pair to value using map() in Pyspark
Last Updated :
05 Feb, 2023
In this article, we are going to learn how to use map() to convert (key, value) pair to value and keys only using Pyspark in Python.
PySpark is the Python library for Spark programming. It is an API for interacting with the Spark cluster using the Python programming language. PySpark provides a simple and easy-to-use API for distributed data processing, machine learning, and graph processing using the power of Apache Spark.
map() function
The map() function is one of the core operations in PySpark. It is a transformation operation that applies a function to each element of an RDD (Resilient Distributed Dataset) and returns a new RDD containing the results. The function passed as an argument to the map() function takes a single argument, which is an element of the RDD, and returns a new element. The map() function is a transformation operation that applies a function to each element of an RDD (Resilient Distributed Dataset) and returns a new RDD containing the results.
Example 1
In this example, we are going to convert key-value pairs into values only. Firstly importing the required module after that store the key-value pairs in the variable kv_rdd with the elements (“a”,1) (“b”,2), and (“c”,3), and then the map() function is applied to kv_rdd RDD, which is a key-value pair RDD using the lambda function passed as an argument to map() takes a single argument x, which is a key-value pair, and returns only the value x[1]. This creates a new RDD containing only the values of the original RDD. Then the collect() method is used to retrieve the values of the new RDD and store them in the variable “values”.
Python3
from pyspark import SparkContext
sc = SparkContext()
kv_rdd = sc.parallelize([( 1 , 'a' ),
( 2 , 'b' ),
( 3 , 'c' )])
value_rdd = kv_rdd. map ( lambda x: x[ 1 ])
values = value_rdd.collect()
print (values)
|
Output:
['a', 'b', 'c']
Example 2
In this example, we are going to convert key-value pairs to keys only. To do so we have to follow the same procedure same as in the first example but in the map() function we have to iterate over the keys of the key-value pair using the lambda expression.
Python3
from pyspark import SparkContext
sc = SparkContext()
kv_rdd = sc.parallelize([( 1 , 'a' ),
( 2 , 'b' ),
( 3 , 'c' )])
keys_rdd = kv_rdd. map ( lambda x:x[ 0 ])
keys = keys_rdd.collect()
print (keys)
|
Output:
[1, 2, 3]
Example 3
In this example, we are going to convert the key-value pair into keys and values as a single entity. To perform this task the lambda function passed as an argument to map() takes a single argument x, which is a key-value pair, and returns the key value too. We store the keys and values separately in the list with the help of list comprehension using collect() function.
Python3
from pyspark import SparkContext
sc = SparkContext()
kv_rdd = sc.parallelize([( 1 , 'a' ),
( 2 , 'b' ),
( 3 , 'c' )])
value_rdd = kv_rdd. map ( lambda x:x)
values = [item for t in value_rdd.collect() for item in t]
print (values)
|
Output:
[1, 'a', 2, 'b', 3, 'c']
Share your thoughts in the comments
Please Login to comment...