Skip to content
Related Articles

Related Articles

Remove duplicates from a dataframe in PySpark

Improve Article
Save Article
  • Last Updated : 16 Dec, 2021
Improve Article
Save Article

In this article, we are going to drop the duplicate data from dataframe using pyspark in Python

Before starting we are going to create Dataframe for demonstration:

Python3




# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data =[["1","sravan","company 1"],
       ["2","ojaswi","company 1"],
       ["3","rohith","company 2"],
       ["4","sridevi","company 1"],
       ["1","sravan","company 1"],
       ["4","sridevi","company 1"]]
 
# specify column names
columns = ['Employee ID','Employee NAME','Company']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
 
print('Actual data in dataframe')
dataframe.show()

Output:

Method 1: Using distinct() method

It will remove the duplicate rows in the dataframe

Syntax: dataframe.distinct()

Where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to drop duplicate data using distinct() function

Python3




print('distinct data after dropping duplicate rows')
 
# display distinct data
dataframe.distinct().show()

Output:

Example 2: Python program to select distinct data in only two columns.

We can use select () function along with distinct function to get distinct values from particular columns

Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()

Python3




# display distinct data in
# Employee ID and Employee NAME
dataframe.select(['Employee ID',
                  'Employee NAME']).distinct().show()

Output:

Method 2: Using dropDuplicates() method

Syntax: dataframe.dropDuplicates()

where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to remove duplicate data from the employee table.

Python3




# remove duplicate data
# using dropDuplicates()function
dataframe.dropDuplicates().show()

Output:

Example 2: Python program to remove duplicate values in specific columns

Python3




# remove duplicate data
# using dropDuplicates()function
# in two columns
dataframe.select(['Employee ID',
                  'Employee NAME']).dropDuplicates().show()

Output:


My Personal Notes arrow_drop_up
Related Articles

Start Your Coding Journey Now!