Open In App

Multiple criteria for aggregation on PySpark Dataframe

In this article, we will discuss how to do Multiple criteria aggregation on PySpark Dataframe.

Data frame in use:



In PySpark,  groupBy() is used to collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data. So by this we can do multiple aggregations at a time.

Syntax:



dataframe.groupBy(‘column_name_group’).agg(functions)

where, 

Lets understand what are the aggregations first. They are available in functions module in pyspark.sql, so we need to import it to start with. The aggregate functions are:

Syntax:

functions.count(‘column_name’)

Syntax:

functions.mean(‘column_name’)

Syntax:

functions.max(‘column_name’)

Syntax:

functions.min(‘column_name’)

Syntax:

functions.sum(‘column_name’)

Syntax:

functions.avg(‘column_name’)

We can aggregate multiple functions using the following syntax.

Syntax:

dataframe.groupBy(‘column_name_group’).agg(functions….)

Example: Multiple aggregations on DEPT column with FEE column




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
#import functions
from pyspark.sql import functions
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of student  data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
  
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
  
# aggregating DEPT column with min.max,sum,mean,avg and count functions
dataframe.groupBy('DEPT').agg(functions.min('FEE'),
                              functions.max('FEE'),
                              functions.sum('FEE'), 
                              functions.mean('FEE'),
                              functions.count('FEE'),
                              functions.avg('FEE')).show()

Output:

Example 2: Multiple aggregation in grouping dept and name column




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
#import functions
from pyspark.sql import functions
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of student  data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
  
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
  
# aggregating DEPT, NAME column with min.max,
# sum,mean,avg and count functions
dataframe.groupBy('DEPT', 'NAME').agg(functions.min('FEE'), 
                                      functions.max('FEE'), 
                                      functions.sum('FEE'),
                                      functions.mean('FEE'), 
                                      functions.count('FEE'), 
                                      functions.avg('FEE')).show()

Output:


Article Tags :