How to count unique ID after groupBy in PySpark Dataframe ?

In this article, we will discuss how to count unique ID after group by in PySpark Dataframe.

For this, we will use two different methods:

Using distinct().count() method.
Using SQL Query.

But at first, let’s Create Dataframe for demonstration:

Python3

# importing module 

import pyspark 

# importing sparksession from pyspark.sql 
# module 

from pyspark.sql import SparkSession 

# creating sparksession and giving an app name 

spark = SparkSession.builder.appName('sparkdf').getOrCreate() 

# list  of students  data 

data = [["1", "sravan", "vignan", 95], 

        ["2", "ojaswi", "vvit", 78], 

        ["3", "rohith", "vvit", 89], 

        ["2", "ojaswi", "vvit", 100], 

        ["4", "sridevi", "vignan", 88], 

        ["1", "sravan", "vignan", 78], 

        ["4", "sridevi", "vignan", 90], 

        ["5", "gnanesh", "iit", 67]] 

# specify column names 

columns = ['student ID', 'student NAME', 

           'college', 'subject marks'] 

# creating a dataframe from the lists of data 

dataframe = spark.createDataFrame(data, columns) 

print("the data is ") 
dataframe.show()

Output:

Method 1 : Using groupBy() and distinct().count() method

groupBy(): Used to group the data based on column name

Syntax: dataframe=dataframe.groupBy(‘column_name1’).sum(‘column name 2’)

distinct().count(): Used to count and display the distinct rows form the dataframe

Syntax: dataframe.distinct().count()

Example 1:

Python3

# group by studentID by marks  

dataframe = dataframe.groupBy( 

  'student ID').sum('subject marks') 

# display count of unique ID 

print("Unique ID count after Group By : ", 

      dataframe.distinct().count()) 

print("the data is ") 

# display  values  of unique ID 
dataframe.distinct().show()

Output:

Unique ID count after Group By :  5
the data is 
+----------+------------------+
|student ID|sum(subject marks)|
+----------+------------------+
|         3|                89|
|         5|                67|
|         1|               173|
|         4|               178|
|         2|               178|
+----------+------------------+

Example 2: Count and display a unique ID of single columns:

Python3

# group by studentID by marks  

dataframe = dataframe.groupBy( 

  'student ID').sum('subject marks') 

# display count of unique ID 

print("Unique ID count after Group By : ", 

      dataframe.distinct().count()) 

print("the data is ") 

# display  values  of unique ID 

dataframe.select('student ID').distinct().show()

Output:

Unique ID count after Group By :  5
the data is 
+----------+
|student ID|
+----------+
|         3|
|         5|
|         1|
|         4|
|         2|
+----------+

Method 2: Using SQL query

We can get a unique ID count by using spark.sql

Syntax:

spark.sql(“sql query”).show()

Python3

# importing module 

import pyspark 

# importing sparksession from pyspark.sql  
# module 

from pyspark.sql import SparkSession 

# creating sparksession and giving an app name 

spark = SparkSession.builder.appName('sparkdf').getOrCreate() 

# list  of students  data 

data = [["1", "sravan", "vignan", 95], 

        ["2", "ojaswi", "vvit", 78], 

        ["3", "rohith", "vvit", 89], 

        ["2", "ojaswi", "vvit", 100], 

        ["4", "sridevi", "vignan", 88], 

        ["1", "sravan", "vignan", 78], 

        ["4", "sridevi", "vignan", 90], 

        ["5", "gnanesh", "iit", 67]] 

# specify column names 

columns = ['student ID', 'student NAME', 

           'college', 'subject marks'] 

# creating a dataframe from the lists of data 

dataframe = spark.createDataFrame(data, columns) 

# group by studentID by marks 

dataframe = dataframe.groupBy('student ID').sum('subject marks') 

# create view for the ablve dataframe and 
# view name is "DATA" 

dataframe.createOrReplaceTempView("DATA") 

# count unique data with sql query 

spark.sql("SELECT DISTINCT(COUNT('student ID'))  \ 

FROM DATA GROUP BY 'subject marks'").show()

Output:

+-----------------+
|count(student ID)|
+-----------------+
|                5|
+-----------------+

Article Tags :

Python

Python-Pyspark