Show distinct column values in PySpark dataframe
Last Updated :
06 Jun, 2021
In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct() and dropDuplicates() functions along with select() function.
Let’s create a sample dataframe.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "3" , "bobby" , "company 3" ],
[ "2" , "ojaswi" , "company 2" ],
[ "1" , "sravan" , "company 1" ],
[ "3" , "bobby" , "company 3" ],
[ "4" , "rohith" , "company 2" ],
[ "5" , "gnanesh" , "company 1" ]]
columns = [ 'Employee ID' , 'Employee NAME' , 'Company Name' ]
dataframe = spark.createDataFrame(data,columns)
dataframe.show()
|
Output:
Method 1: Using distinct()
This function returns distinct values from column using distinct() function.
Syntax: dataframe.select(“column_name”).distinct().show()
Example1: For a single column.
Python3
dataframe.select( "Employee ID" ).distinct().show()
|
Output:
+-----------+
|Employee ID|
+-----------+
| 3|
| 5|
| 1|
| 4|
| 2|
+-----------+
Example 2: For multiple columns.
Python code to display unique data from 2 columns using distinct() function.
Syntax: dataframe.select(“column_name 1, column_name 2 “).distinct().show()
Code:
Python3
dataframe.select([ "Employee ID" ,
"Employee NAME" ]).distinct().show()
|
Output:
+-----------+-------------+
|Employee ID|Employee NAME|
+-----------+-------------+
| 5| gnanesh|
| 4| rohith|
| 1| sravan|
| 2| ojaswi|
| 3| bobby|
+-----------+-------------+
Method 2: Using dropDuplicates()
This function displays unique data in one column from dataframe using dropDuplicates() function.
Syntax: dataframe.select(“column_name”).dropDuplicates().show()
Example 1: For single columns.
Python3
dataframe.select( "Employee ID" ).dropDuplicates().show()
|
Output:
+-----------+
|Employee ID|
+-----------+
| 3|
| 5|
| 1|
| 4|
| 2|
+-----------+
Example 2: For multiple columns
Python code to display unique data from 2 columns using dropDuplicates() function
Python3
dataframe.select([ "Employee ID" ,
"Employee NAME" ]).dropDuplicates().show()
|
Output:
+-----------+-------------+
|Employee ID|Employee NAME|
+-----------+-------------+
| 5| gnanesh|
| 4| rohith|
| 1| sravan|
| 2| ojaswi|
| 3| bobby|
+-----------+-------------+
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...