In this article, we are going to see how to orderby multiple columns in PySpark DataFrames through Python.
Create the dataframe for demonstration:
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of employee data data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
# specify column names columns = [ 'ID' , 'NAME' , 'Company' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
dataframe.show() |
Output:
orderby means we are going to sort the dataframe by multiple columns in ascending or descending order. we can do this by using the following methods.
Method 1 : Using orderBy()
This function will return the dataframe after ordering the multiple columns. It will sort first based on the column name given.
Syntax:
- Ascending order: dataframe.orderBy([‘column1′,’column2′,……,’column n’], ascending=True).show()
- Descending Order: dataframe.orderBy([‘column1′,’column2′,……,’column n’], ascending=False).show()
where:
- dataframe is the Pyspark Input dataframe
- ascending=True specifies to sort the dataframe in ascending order
- ascending=False specifies to sort the dataframe in descending order
Example 1: Sort the PySpark dataframe in ascending order with orderBy().
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of employee data data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
# specify column names columns = [ 'ID' , 'NAME' , 'Company' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
# orderBy dataframe in asc order dataframe.orderBy([ 'Name' , 'ID' , 'Company' ],
ascending = True ).show()
|
Output:
Example 2: Sort the PySpark dataframe in descending order with orderBy().
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of employee data data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
# specify column names columns = [ 'ID' , 'NAME' , 'Company' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
# orderBy dataframe in desc order dataframe.orderBy([ 'Name' , 'ID' , 'Company' ],
ascending = False ).show()
|
Output:
Method 2: Using sort()
This function will return the dataframe after ordering the multiple columns. It will sort first based on the column name given.
Syntax:
- Ascending order: dataframe.sort([‘column1′,’column2′,……,’column n’], ascending=True).show()
- Descending Order: dataframe.sort([‘column1′,’column2′,……,’column n’], ascending=False).show()
where,
- dataframe is the Pyspark Input dataframe
- ascending=True specifies to sort the dataframe in ascending order
- ascending=False specifies to sort the dataframe in descending order
Example 1: Sort PySpark dataframe in ascending order
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of employee data data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
# specify column names columns = [ 'ID' , 'NAME' , 'Company' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
# orderBy dataframe in asc order dataframe.sort([ 'Name' , 'ID' , 'Company' ],
ascending = True ).show()
|
Output:
Example 2: Sort the PySpark dataframe in descending order
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of employee data data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
# specify column names columns = [ 'ID' , 'NAME' , 'Company' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
# orderBy dataframe in desc order dataframe.sort([ 'Name' , 'ID' , 'Company' ],
ascending = False ).show()
|
Output: