Open In App

Remove duplicates from a dataframe in PySpark

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to drop the duplicate data from dataframe using pyspark in Python

Before starting we are going to create Dataframe for demonstration:

Python3




# importing module
import pyspark
 
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
 
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
 
# list  of employee data
data =[["1","sravan","company 1"],
       ["2","ojaswi","company 1"],
       ["3","rohith","company 2"],
       ["4","sridevi","company 1"],
       ["1","sravan","company 1"],
       ["4","sridevi","company 1"]]
 
# specify column names
columns = ['Employee ID','Employee NAME','Company']
 
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data,columns)
 
print('Actual data in dataframe')
dataframe.show()


Output:

Method 1: Using distinct() method

It will remove the duplicate rows in the dataframe

Syntax: dataframe.distinct()

Where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to drop duplicate data using distinct() function

Python3




print('distinct data after dropping duplicate rows')
 
# display distinct data
dataframe.distinct().show()


Output:

Example 2: Python program to select distinct data in only two columns.

We can use select () function along with distinct function to get distinct values from particular columns

Syntax: dataframe.select([‘column 1′,’column n’]).distinct().show()

Python3




# display distinct data in
# Employee ID and Employee NAME
dataframe.select(['Employee ID',
                  'Employee NAME']).distinct().show()


Output:

Method 2: Using dropDuplicates() method

Syntax: dataframe.dropDuplicates()

where, dataframe is the dataframe name created from the nested lists using pyspark

Example 1: Python program to remove duplicate data from the employee table.

Python3




# remove duplicate data
# using dropDuplicates()function
dataframe.dropDuplicates().show()


Output:

Example 2: Python program to remove duplicate values in specific columns

Python3




# remove duplicate data
# using dropDuplicates()function
# in two columns
dataframe.select(['Employee ID',
                  'Employee NAME']).dropDuplicates().show()


Output:



Last Updated : 16 Dec, 2021
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads