Skip to content
Related Articles

Related Articles

Improve Article
Save Article
Like Article

How to sort by value in PySpark?

  • Last Updated : 18 Jul, 2021

In this article, we are going to sort by value in PySpark.

Creating RDD for demonstration:


# importing module
from pyspark.sql import SparkSession, Row
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# create 2 Rows with 3 columns
data = Row(First_name="Sravan", Last_name="Kumar", age=23),
Row(First_name="Ojaswi", Last_name="Pinkey", age=16),
Row(First_name="Rohith", Last_name="Devi", age=7)
# create row on rdd
rdd = spark.sparkContext.parallelize(data)
# display data


[Row(First_name='Sravan', Last_name='Kumar', age=23),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7)]

Method 1: Using sortBy()

sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd.

Syntax: rdd.sortBy(lambda expression)

It uses a lambda expression to sort the data based on columns.

lambda expression: lambda x: x[column_index]

Example 1: Sort the data by values based on column 1


# sort the data by values based on column 1
rdd.sortBy(lambda x: x[0]).collect()


[Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Sravan', Last_name='Kumar', age=23)]

Example 2: Sort data based on column 2 values


# sort the data by values based on column 2
rdd.sortBy(lambda x: x[2]).collect()


[Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Sravan', Last_name='Kumar', age=23)]

Method 2: Using takeOrdered()

It is the method available in RDD, this is used to sort values based on values in a particular column.

Syntax: rdd.takeOrdered(n,lambda expression)

where, n is the total rows to be displayed after sorting

Sort values based on a particular column using takeOrdered function


# sort values based on
# column 1 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[0]))
# sort values based on
# column 3 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[2]))


[Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]

[Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]

My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!