Open In App

Pyspark GroupBy DataFrame with Aggregation or Count

Last Updated : 21 Mar, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Pyspark is a powerful tool for working with large datasets in a distributed environment using Python. One of the most common tasks in data manipulation is grouping data by one or more columns. This can be accomplished using the groupBy() function in Pyspark, which allows you to group a DataFrame based on the values in one or more columns. In this article, we will explore how to use the groupBy() function in Pyspark with aggregation or count.

Syntax of groupBy() Function

The groupBy() function in Pyspark is a powerful tool for working with large Datasets. It allows you to group DataFrame based on the values in one or more columns. The syntax of groupBy() function with its parameter is given below:

Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

Creating a Pyspark DataFrame 

Here, we will create a simple DataFrame containing student data, with columns for the customer’s ID, NAME, DEPT, and FEE. We first import the necessary packages, we build an app with the app name “GroupBy” after that a data frame is created using the spark.createDataFrame() function.

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list of student data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
  
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# display
dataframe.show()


Output:

Pyspark groupBy DataFrame with aggregation or count

 

Pyspark groupBy DataFrame with count

Here, we are using count(), It will return the count of rows for each group.

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list of student data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
  
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# Groupby with DEPT with count()
dataframe.groupBy('DEPT').count().show()


Output:

Pyspark groupBy DataFrame with aggregation or count

 

Pyspark GroupBy DataFrame with Aggregation

Here, we are importing these agg functions from the module sql.functions. By using Groupby with DEPT with sum() , min() , max() we can collect the identical data into groups on the PySpark DataFrame and perform aggregate functions on the grouped data.

Python3




# importing module
import pyspark
  
# import sum, min,avg,count,mean and max functions
from pyspark.sql.functions import sum, max, min, avg, count, mean
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list of student data
data = [["1", "sravan", "IT", 45000],
        ["2", "ojaswi", "CS", 85000],
        ["3", "rohith", "CS", 41000],
        ["4", "sridevi", "IT", 56000],
        ["5", "bobby", "ECE", 45000],
        ["6", "gayatri", "ECE", 49000],
        ["7", "gnanesh", "CS", 45000],
        ["8", "bhanu", "Mech", 21000]
        ]
  
# specify column names
columns = ['ID', 'NAME', 'DEPT', 'FEE']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# Groupby with DEPT with sum() , min() , max()
dataframe.groupBy("DEPT").agg(max("FEE"), sum("FEE"),
                              min("FEE"), mean("FEE"),
                              count("FEE")).show()


Output:

Pyspark groupBy DataFrame with aggregation or count

 

In conclusion, 



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads