PySpark – Order by multiple columns
Last Updated :
19 Dec, 2021
In this article, we are going to see how to orderby multiple columns in PySpark DataFrames through Python.
Create the dataframe for demonstration:
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
|
Output:
orderby means we are going to sort the dataframe by multiple columns in ascending or descending order. we can do this by using the following methods.
Method 1 : Using orderBy()
This function will return the dataframe after ordering the multiple columns. It will sort first based on the column name given.
Syntax:
- Ascending order: dataframe.orderBy([‘column1′,’column2′,……,’column n’], ascending=True).show()
- Descending Order: dataframe.orderBy([‘column1′,’column2′,……,’column n’], ascending=False).show()
where:
- dataframe is the Pyspark Input dataframe
- ascending=True specifies to sort the dataframe in ascending order
- ascending=False specifies to sort the dataframe in descending order
Example 1: Sort the PySpark dataframe in ascending order with orderBy().
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.orderBy([ 'Name' , 'ID' , 'Company' ],
ascending = True ).show()
|
Output:
Example 2: Sort the PySpark dataframe in descending order with orderBy().
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.orderBy([ 'Name' , 'ID' , 'Company' ],
ascending = False ).show()
|
Output:
Method 2: Using sort()
This function will return the dataframe after ordering the multiple columns. It will sort first based on the column name given.
Syntax:
- Ascending order: dataframe.sort([‘column1′,’column2′,……,’column n’], ascending=True).show()
- Descending Order: dataframe.sort([‘column1′,’column2′,……,’column n’], ascending=False).show()
where,
- dataframe is the Pyspark Input dataframe
- ascending=True specifies to sort the dataframe in ascending order
- ascending=False specifies to sort the dataframe in descending order
Example 1: Sort PySpark dataframe in ascending order
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.sort([ 'Name' , 'ID' , 'Company' ],
ascending = True ).show()
|
Output:
Example 2: Sort the PySpark dataframe in descending order
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "5" , "bobby" , "company 1" ]]
columns = [ 'ID' , 'NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
dataframe.sort([ 'Name' , 'ID' , 'Company' ],
ascending = False ).show()
|
Output:
Share your thoughts in the comments
Please Login to comment...