Open In App

Remove duplicates from a dataframe in PySpark

In this article, we are going to drop the duplicate data from dataframe using pyspark in Python

Before starting we are going to create Dataframe for demonstration:

# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list  of employee data
data =[["1","sravan","company 1"],
       ["2","ojaswi","company 1"],
       ["3","rohith","company 2"],
       ["4","sridevi","company 1"],
       ["1","sravan","company 1"],
       ["4","sridevi","company 1"]]
# specify column names
columns = ['Employee ID','Employee NAME','Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
print('Actual data in dataframe')


Method 1: Using distinct() method

It will remove the duplicate rows in the dataframe

Syntax: dataframe.distinct()

Where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to drop duplicate data using distinct() function

print('distinct data after dropping duplicate rows')
# display distinct data


Example 2: Python program to select distinct data in only two columns.

We can use select () function along with distinct function to get distinct values from particular columns

Syntax:[‘column 1′,’column n’]).distinct().show()

# display distinct data in
# Employee ID and Employee NAME['Employee ID',
                  'Employee NAME']).distinct().show()


Method 2: Using dropDuplicates() method

Syntax: dataframe.dropDuplicates()

where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to remove duplicate data from the employee table.

# remove duplicate data
# using dropDuplicates()function


Example 2: Python program to remove duplicate values in specific columns

# remove duplicate data
# using dropDuplicates()function
# in two columns['Employee ID',
                  'Employee NAME']).dropDuplicates().show()


Article Tags :