Sort Pyspark Dataframe Within Groups

Last Updated : 03 Jan, 2023

Are you the one who likes to play with data in Python, especially the Pyspark data set? Then, you might know about various functions which you can apply to the dataset. But do you know that you can even rearrange the data either in ascending or descending order after grouping the same columns on the Pyspark data frame? Want to know, how to achieve it? Read the article further to know more about it in detail.

Modules Required:

Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, it is known as Pyspark. This module can be installed through the following command in Python:

pip install pyspark

Methods to sort Pyspark data frame within groups

Using sort function
Using orderBy function

Method 1: Using sort() function

In this method, we are going to use sort() function to sort the data frame in Pyspark. This function takes the Boolean value as an argument to sort in ascending or descending order.

Syntax:
sort(x, decreasing, na.last)

Parameters:

x: list of Column or column names to sort by

decreasing: Boolean value to sort in descending order

na.last: Boolean value to put NA at the end

Stepwise Implementation:

Step 1: First, import the required libraries, i.e. SparkSession, sum, and desc. The SparkSession library is used to create the session, the sum is used to sum the columns on which groupby is applied, while desc is used to sort the list in descending order.

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, desc

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file which you want to sort within groups.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)

Step 4: Finally, make the group of a data frame using the groupby function and further arrange it in ascending or descending order using the sort function. Also, we have used the agg function for the columns on which groupby has to be applied.

data_frame.groupby("#column-name").agg(sum("#column-name").alias("#column-name")).sort(desc("#column-name")).show()

Example 1:

In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the sort and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.

Python3

# Python program to sort pyspark
# data within groups by single column
 
# Import the SparkSession, sum and desc libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc, sum, asc
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
    '/content/marks_data.csv', sep=',',
    inferSchema=True, header=True)
 
# Make the groups of dataframe using groupby function and
# sort in ascending order by using sort function
data_frame.groupby("class", "name").agg(
    sum("marks").alias("marks")).sort(asc("class")).show()

Output:

Example 2:

In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the sort and desc function as well as sorted in ascending order through the column marks using the sort and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.

Python3

# Python program to sort pyspark
# data within groups by multiple columns
 
# Import the SparkSession, sum and desc libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc, sum, asc
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
    '/content/marks_data.csv', sep=',',
    inferSchema=True, header=True)
 
# Make the groups of dataframe using groupby function and
# sort in descending order by using sort function
data_frame.groupby("class", "name").agg(sum("marks").alias(
    "marks")).sort(desc("class"), asc("marks")).show()

Output:

Method 2: Using orderBy() function

In this method, we are going to use orderBy() function to sort the data frame in Pyspark. It is used to sort an object by its index value.

Syntax: DataFrame.orderBy(cols, args)

Parameters :

cols: List of columns to be ordered

args: Specifies the sorting order i.e (ascending or descending) of columns listed in cols

Return type: Returns a new DataFrame sorted by the specified columns.

Stepwise Implementation:

Step 1: First, import the required libraries, i.e. SparkSession, sum, and desc. The SparkSession library is used to create the session, while the sum is used to sum the columns on which groupby is applied. The desc is used to sort the list in descending order, while the col is used to return a column name based on the given column name.

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, desc, col

Step 2: Now, create a spark session using the getOrCreate function.

spark_session = SparkSession.builder.getOrCreate()

Step 3: Then, read the CSV file which you want to sort within groups.

data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)

Step 4: Finally, make the group of a data frame using the groupby function and further arrange it in ascending or descending order using the orderBy function. Also, we have used the agg function for the columns on which groupby has to be applied.

data_frame.groupby("#column-name").agg(sum("#column-name").alias("#column-name")).orderBy(col("#column-name").desc()).show()

Example 1:

In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the orderBy and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.

Python3

# Python program to sort pyspark
# data within groups by single column
 
# Import the SparkSession, sum, col and asc libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, asc, col
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
    '/content/marks_data.csv',
    sep=',', inferSchema=True, header=True)
 
# Make the groups of dataframe using groupby function and
# sort in ascending order by using orderBy function
data_frame.groupby("class", "name").agg(
    sum("marks").alias("marks")).sort(col("class").asc()).show()

Output:

Example 2:

In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the orderBy and desc function as well as sorted in ascending order through the column marks using the orderBy and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.

Python3

# Python program to sort pyspark
# data within groups by multiple columns
 
# Import the SparkSession, sum, col and desc libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc, sum, asc, col
 
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
 
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
    '/content/marks_data.csv',
    sep=',', inferSchema=True, header=True)
 
# Make the groups of dataframe using groupby function and  sort
# in ascending and descending order by using orderBy function
data_frame.groupby("class", "name").agg(sum("marks").alias("marks")).sort(
    col("class").desc(), col("marks").asc()).show()

Output:

Suggest improvement

PySpark Dataframe Split

Share your thoughts in the comments

Sort Pyspark Dataframe Within Groups

Modules Required:

Methods to sort Pyspark data frame within groups

Method 1: Using sort() function

Stepwise Implementation:

Python3

Python3

Method 2: Using orderBy() function

Stepwise Implementation:

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?