In this article, we will discuss how to find distinct values of multiple columns in PySpark dataframe.
Let’s create a sample dataframe for demonstration:
# importing module import pyspark
# importing sparksession from pyspark.sql module from pyspark.sql import SparkSession
# creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
# list of employee data data = [[ "1" , "Tezas" , "Google" ],
[ "2" , "Mohit Rawat" , "Rakuten" ],
[ "3" , "rohith" , "Geeksforgeeks" ],
[ "4" , "Nancy" , "IBM" ],
[ "1" , "Raghav" , "Wipro" ],
[ "4" , "Komal" , "Amazon" ]]
# specify column names columns = [ 'ID' , 'NAME' , 'Company' ]
# creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns)
dataframe.show() |
Output:
Method 1: Using distinct() method
The distinct() method is utilized to drop/remove the duplicate elements from the DataFrame.
Syntax: df.distinct(column)
Example 1: Get a distinct Row of all Dataframe.
dataframe.distinct().show() |
Output:
Example 2: Get distinct Value of single Columns.
It can be done by passing a single column name with dataframe.
dataframe.select( 'NAME' ).distinct().show()
|
Output:
Example 3: Get distinct Value of Multiple Columns.
It can be done by passing multiple column names as a form of a list with dataframe.
dataframe.select( 'ID' , "NAME" ).distinct().show()
|
Method 2: Using dropDuplicates() method.
The dropDuplicates() used to remove rows that have the same values on multiple selected columns.
Syntax: df.dropDuplicates()
Example 1: Get a distinct Row of all Dataframe.
dataframe.dropDuplicates().show() |
Output:
Example 2: Get distinct Value of single Columns.
It can be done by passing a single column name with dataframe.
dataframe.select( "NAME" ).dropDuplicates().show()
|
Output:
Example 3: Get distinct Value of multiple Columns.
It can be done by passing multiple column names as a form of a list with dataframe.
dataframe.dropDuplicates([ "NAME" , "ID" ]).select([ "ID" , "NAME" ]).show()
|
Output: