Drop duplicate rows in PySpark DataFrame
Last Updated :
29 Aug, 2022
In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python.
Let’s create a sample Dataframe
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "company 1" ],
[ "2" , "ojaswi" , "company 1" ],
[ "3" , "rohith" , "company 2" ],
[ "4" , "sridevi" , "company 1" ],
[ "1" , "sravan" , "company 1" ],
[ "4" , "sridevi" , "company 1" ]]
columns = [ 'Employee ID' , 'Employee NAME' , 'Company' ]
dataframe = spark.createDataFrame(data, columns)
print ( 'Actual data in dataframe' )
dataframe.show()
|
Output:
Method 1: Distinct
Distinct data means unique data. It will remove the duplicate rows in the dataframe
Syntax: dataframe.distinct()
where, dataframe is the dataframe name created from the nested lists using pyspark
Python3
print ( 'distinct data after dropping duplicate rows' )
dataframe.distinct().show()
|
Output:
We can use the select() function along with distinct function to get distinct values from particular columns
Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()
Python3
dataframe.select([ 'Employee ID' , 'Employee NAME' ]).distinct().show()
|
Output:
Method 2: dropDuplicate
Syntax: dataframe.dropDuplicates()
where, dataframe is the dataframe name created from the nested lists using pyspark
Python3
dataframe.dropDuplicates().show()
|
Output:
Python program to remove duplicate values in specific columns
Python3
dataframe.select([ 'Employee ID' , 'Employee NAME' ]
).dropDuplicates().show()
|
Output:
Share your thoughts in the comments
Please Login to comment...