Open In App

How to sort by value in PySpark?

Last Updated : 18 Jul, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to sort by value in PySpark.

Creating RDD for demonstration:

Python3




# importing module
from pyspark.sql import SparkSession, Row
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# create 2 Rows with 3 columns
data = Row(First_name="Sravan", Last_name="Kumar", age=23),
Row(First_name="Ojaswi", Last_name="Pinkey", age=16),
Row(First_name="Rohith", Last_name="Devi", age=7)
  
# create row on rdd
rdd = spark.sparkContext.parallelize(data)
  
# display data
rdd.collect()


Output:

[Row(First_name='Sravan', Last_name='Kumar', age=23),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7)]

Method 1: Using sortBy()

sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd.

Syntax: rdd.sortBy(lambda expression)

It uses a lambda expression to sort the data based on columns.

lambda expression: lambda x: x[column_index]

Example 1: Sort the data by values based on column 1

Python3




# sort the data by values based on column 1
rdd.sortBy(lambda x: x[0]).collect()


Output:

[Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Sravan', Last_name='Kumar', age=23)]

Example 2: Sort data based on column 2 values

Python3




# sort the data by values based on column 2
rdd.sortBy(lambda x: x[2]).collect()


Output:

[Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Sravan', Last_name='Kumar', age=23)]

Method 2: Using takeOrdered()

It is the method available in RDD, this is used to sort values based on values in a particular column.

Syntax: rdd.takeOrdered(n,lambda expression)

where, n is the total rows to be displayed after sorting

Sort values based on a particular column using takeOrdered function

Python3




# sort values based on
# column 1 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[0]))
  
# sort values based on
# column 3 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[2]))


Output:

[Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]

[Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads