Removing duplicate rows based on specific column in PySpark DataFrame
In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method:
Syntax: dataframe.dropDuplicates([‘column 1′,’column 2′,’column n’]).show()
- dataframe is the input dataframe and column name is the specific column
- show() method is used to display the dataframe
Let’s create the dataframe.
Dropping based on one column
Dropping based on multiple columns
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course