In this article, we are going to sort by value in PySpark.
Creating RDD for demonstration:
# importing module from pyspark.sql import SparkSession, Row
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# create 2 Rows with 3 columns data = Row(First_name = "Sravan" , Last_name = "Kumar" , age = 23 ),
Row(First_name = "Ojaswi" , Last_name = "Pinkey" , age = 16 ),
Row(First_name = "Rohith" , Last_name = "Devi" , age = 7 )
# create row on rdd rdd = spark.sparkContext.parallelize(data)
# display data rdd.collect() |
Output:
[Row(First_name='Sravan', Last_name='Kumar', age=23), Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Rohith', Last_name='Devi', age=7)]
Method 1: Using sortBy()
sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd.
Syntax: rdd.sortBy(lambda expression)
It uses a lambda expression to sort the data based on columns.
lambda expression: lambda x: x[column_index]
Example 1: Sort the data by values based on column 1
# sort the data by values based on column 1 rdd.sortBy( lambda x: x[ 0 ]).collect()
|
Output:
[Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Rohith', Last_name='Devi', age=7), Row(First_name='Sravan', Last_name='Kumar', age=23)]
Example 2: Sort data based on column 2 values
# sort the data by values based on column 2 rdd.sortBy( lambda x: x[ 2 ]).collect()
|
Output:
[Row(First_name='Rohith', Last_name='Devi', age=7), Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Sravan', Last_name='Kumar', age=23)]
Method 2: Using takeOrdered()
It is the method available in RDD, this is used to sort values based on values in a particular column.
Syntax: rdd.takeOrdered(n,lambda expression)
where, n is the total rows to be displayed after sorting
Sort values based on a particular column using takeOrdered function
# sort values based on # column 1 using takeOrdered function print (rdd.takeOrdered( 3 , lambda x: x[ 0 ]))
# sort values based on # column 3 using takeOrdered function print (rdd.takeOrdered( 3 , lambda x: x[ 2 ]))
|
Output:
[Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]
[Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]