Remove duplicates from a dataframe in PySpark

In this article, we are going to drop the duplicate data from dataframe using pyspark in Python

Before starting we are going to create Dataframe for demonstration:

Python3

# importing module

import pyspark
 
# importing sparksession from pyspark.sql module

from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name

spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data 

data =[["1","sravan","company 1"],

       ["2","ojaswi","company 1"],

       ["3","rohith","company 2"],

       ["4","sridevi","company 1"],

       ["1","sravan","company 1"],

       ["4","sridevi","company 1"]]
 
# specify column names

columns = ['Employee ID','Employee NAME','Company']
 
# creating a dataframe from the lists of data

dataframe = spark.createDataFrame(data,columns)
 
print('Actual data in dataframe')
dataframe.show()

Output:

Method 1: Using distinct() method

It will remove the duplicate rows in the dataframe

Syntax: dataframe.distinct()

Where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to drop duplicate data using distinct() function

Python3

print('distinct data after dropping duplicate rows')
 
# display distinct data
dataframe.distinct().show()

Output:

Example 2: Python program to select distinct data in only two columns.

We can use select () function along with distinct function to get distinct values from particular columns

Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()

Python3

# display distinct data in
# Employee ID and Employee NAME 

dataframe.select(['Employee ID',

                  'Employee NAME']).distinct().show()

Output:

Method 2: Using dropDuplicates() method

Syntax: dataframe.dropDuplicates()

where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to remove duplicate data from the employee table.

Python3

# remove duplicate data
# using dropDuplicates()function
dataframe.dropDuplicates().show()

Output:

Example 2: Python program to remove duplicate values in specific columns

Python3

# remove duplicate data
# using dropDuplicates()function 
# in two columns

dataframe.select(['Employee ID',

                  'Employee NAME']).dropDuplicates().show()

Output:

Article Tags :

Python

Python-Pyspark