Removing duplicate rows based on specific column in PySpark DataFrame
In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method:
Syntax: dataframe.dropDuplicates([‘column 1′,’column 2′,’column n’]).show()
- dataframe is the input dataframe and column name is the specific column
- show() method is used to display the dataframe
Let’s create the dataframe.
Dropping based on one column
Dropping based on multiple columns