Are you the one who likes to play with data in Python, especially the Pyspark data set? Then, you might know about various functions which you can apply to the dataset. But do you know that you can even rearrange the data either in ascending or descending order after grouping the same columns on the Pyspark data frame? Want to know, how to achieve it? Read the article further to know more about it in detail.
Modules Required:
Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, it is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Methods to sort Pyspark data frame within groups
- Using sort function
- Using orderBy function
Method 1: Using sort() function
In this method, we are going to use sort() function to sort the data frame in Pyspark. This function takes the Boolean value as an argument to sort in ascending or descending order.
Syntax:
sort(x, decreasing, na.last)
Parameters:
- x: list of Column or column names to sort by
- decreasing: Boolean value to sort in descending order
- na.last: Boolean value to put NA at the end
Stepwise Implementation:
Step 1: First, import the required libraries, i.e. SparkSession, sum, and desc. The SparkSession library is used to create the session, the sum is used to sum the columns on which groupby is applied, while desc is used to sort the list in descending order.
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, desc
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file which you want to sort within groups.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Finally, make the group of a data frame using the groupby function and further arrange it in ascending or descending order using the sort function. Also, we have used the agg function for the columns on which groupby has to be applied.
data_frame.groupby("#column-name").agg(sum("#column-name").alias("#column-name")).sort(desc("#column-name")).show()
Example 1:
In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the sort and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc, sum , asc
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/marks_data.csv' , sep = ',' ,
inferSchema = True , header = True )
data_frame.groupby( "class" , "name" ).agg(
sum ( "marks" ).alias( "marks" )).sort(asc( "class" )).show()
|
Output:
Example 2:
In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the sort and desc function as well as sorted in ascending order through the column marks using the sort and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc, sum , asc
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/marks_data.csv' , sep = ',' ,
inferSchema = True , header = True )
data_frame.groupby( "class" , "name" ).agg( sum ( "marks" ).alias(
"marks" )).sort(desc( "class" ), asc( "marks" )).show()
|
Output:
Method 2: Using orderBy() function
In this method, we are going to use orderBy() function to sort the data frame in Pyspark. It is used to sort an object by its index value.
Syntax: DataFrame.orderBy(cols, args)
Parameters :
- cols: List of columns to be ordered
- args: Specifies the sorting order i.e (ascending or descending) of columns listed in cols
Return type: Returns a new DataFrame sorted by the specified columns.
Stepwise Implementation:
Step 1: First, import the required libraries, i.e. SparkSession, sum, and desc. The SparkSession library is used to create the session, while the sum is used to sum the columns on which groupby is applied. The desc is used to sort the list in descending order, while the col is used to return a column name based on the given column name.
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, desc, col
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file which you want to sort within groups.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Finally, make the group of a data frame using the groupby function and further arrange it in ascending or descending order using the orderBy function. Also, we have used the agg function for the columns on which groupby has to be applied.
data_frame.groupby("#column-name").agg(sum("#column-name").alias("#column-name")).orderBy(col("#column-name").desc()).show()
Example 1:
In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the orderBy and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum , asc, col
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/marks_data.csv' ,
sep = ',' , inferSchema = True , header = True )
data_frame.groupby( "class" , "name" ).agg(
sum ( "marks" ).alias( "marks" )).sort(col( "class" ).asc()).show()
|
Output:
Example 2:
In this example, we took the data frame (link) i.e., the dataset of 3×6, which we have grouped through the class and name columns using groupby function and sorted in descending order through the class column using the orderBy and desc function as well as sorted in ascending order through the column marks using the orderBy and asc function. We have also used the agg and sum functions to do the sum of the marks column which are the same when groupby is applied.
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import desc, sum , asc, col
spark_session = SparkSession.builder.getOrCreate()
data_frame = csv_file = spark_session.read.csv(
'/content/marks_data.csv' ,
sep = ',' , inferSchema = True , header = True )
data_frame.groupby( "class" , "name" ).agg( sum ( "marks" ).alias( "marks" )).sort(
col( "class" ).desc(), col( "marks" ).asc()).show()
|
Output:
Share your thoughts in the comments
Please Login to comment...