Open In App

How to drop duplicates and keep one in PySpark dataframe

Last Updated : 17 Jun, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will discuss how to handle duplicate values in a pyspark dataframe. A dataset may contain repeated rows or repeated data points that are not useful for our task. These repeated values in our dataframe are called duplicate values.

To handle duplicate values, we may use a strategy in which we keep the first occurrence of the values and drop the rest. 

dropduplicates(): Pyspark dataframe provides dropduplicates() function that is used to drop duplicate occurrences of data inside a dataframe.

Syntax: dataframe_name.dropDuplicates(Column_name)

The function takes Column names as parameters concerning which the duplicate values have to be removed.

Creating Dataframe for demonstration:

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField,
StringType, IntegerType, FloatType
  
# Start spark session
spark = SparkSession.builder.appName("Student_Info").getOrCreate()
  
# Initialize our data
data2 = [("Pulkit", 12, "CS32", 82, "Programming"),
         ("Ritika", 20, "CS32", 94, "Writing"),
         ("Ritika", 20, "CS32", 84, "Writing"),
         ("Atirikt", 4, "BB21", 58, "Doctor"),
         ("Atirikt", 4, "BB21", 78, "Doctor"),
         ("Ghanshyam", 4, "DD11", 38, "Lawyer"),
         ("Reshav", 18, "EE43", 56, "Timepass")
         ]
  
# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Roll Number", IntegerType(), True),
    StructField("Class ID", StringType(), True),
    StructField("Marks", IntegerType(), True),
    StructField("Extracurricular", StringType(), True)
])
  
# read the dataframe
df = spark.createDataFrame(data=data2, schema=schema)
df.show()


Output:

Examples 1: This example illustrates the working of dropDuplicates() function over a single column parameter. The dataset is custom-built, so we had defined the schema and used spark.createDataFrame() function to create the dataframe.

Python3




# drop duplicates
df.dropDuplicates(['Roll Number']).show()
  
# stop Session
spark.stop()


Output:

From the above observation, it is clear that the rows with duplicate Roll Number were removed and only the first occurrence kept in the dataframe.

Example 2: This example illustrates the working of dropDuplicates() function over multiple column parameters. The dataset is custom-built so we had defined the schema and used spark.createDataFrame() function to create the dataframe.

Python3




# drop duplicates
df.dropDuplicates(['Roll Number',"Name"]).show()
  
# stop the session
spark.stop()


Output:

From the above observation, it is clear that the data points with duplicate Roll Numbers and Names were removed and only the first occurrence kept in the dataframe.

Note: The data having both the parameters as a duplicate was only removed. In the above example, the Column Name of “Ghanshyam” had a Roll Number duplicate value, but the Name was unique, so it was not removed from the dataframe. Thus, the function considers all the parameters not only one of them.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads