Skip to content
Related Articles

Related Articles

Improve Article
Save Article
Like Article

Remove duplicates from a dataframe in PySpark

  • Last Updated : 17 Jun, 2021

In this article, we are going to drop the duplicate data from dataframe using pyspark in Python

Before starting we are going to create Dataframe for demonstration:

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course


# importing module
import pyspark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list  of employee data 
data =[["1","sravan","company 1"],
       ["2","ojaswi","company 1"],
       ["3","rohith","company 2"],
       ["4","sridevi","company 1"],
       ["1","sravan","company 1"],
       ["4","sridevi","company 1"]]
# specify column names
columns = ['Employee ID','Employee NAME','Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
print('Actual data in dataframe')


Method 1: Using distinct() method

It will remove the duplicate rows in the dataframe

Syntax: dataframe.distinct()

Where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to drop duplicate data using distinct() function


print('distinct data after dropping duplicate rows')
# display distinct data


Example 2: Python program to select distinct data in only two columns.

We can use select () function along with distinct function to get distinct values from particular columns

Syntax:[‘column 1′,’column n’]).distinct().show()


# display distinct data in
# Employee ID and Employee NAME['Employee ID',
                  'Employee NAME']).distinct().show()


Method 2: Using dropDuplicates() method

Syntax: dataframe.dropDuplicates()

where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to remove duplicate data from the employee table.


# remove duplucate data
# using dropDuplicates()function


Example 2: Python program to remove duplicate values in specific columns


# remove duplicate data
# using dropDuplicates()function 
# in two columns['Employee ID',
                  'Employee NAME']).dropDuplicates().show()


My Personal Notes arrow_drop_up
Recommended Articles
Page :