Removing duplicate rows based on specific column in PySpark DataFrame
In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method:
Syntax: dataframe.dropDuplicates([‘column 1′,’column 2′,’column n’]).show()
where,
- dataframe is the input dataframe and column name is the specific column
- show() method is used to display the dataframe
Let’s create the dataframe.
Python3
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate()
data = [[ "1" , "sravan" , "vignan" ], [ "2" , "ojaswi" , "vvit" ],
[ "3" , "rohith" , "vvit" ], [ "4" , "sridevi" , "vignan" ],
[ "1" , "sravan" , "vignan" ], [ "5" , "gnanesh" , "iit" ]]
columns = [ 'student ID' , 'student NAME' , 'college' ]
dataframe = spark.createDataFrame(data, columns)
print ( 'Actual data in dataframe' )
dataframe.show()
|
Output:
Dropping based on one column
Python3
dataframe.dropDuplicates([ 'college' ]).show()
|
Output:
Dropping based on multiple columns
Python3
dataframe.dropDuplicates([ 'college' , 'student ID' ]).show()
|
Output:
Last Updated :
06 Jun, 2021
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...