Find duplicate rows in a Dataframe based on all or selected columns
In this article, we will be discussing about how to find duplicate rows in a Dataframe based on all or a list of columns. For this we will use
Dataframe.duplicated() method of Pandas.
Syntax : DataFrame.duplicated(subset = None, keep = ‘first’)
subset: This Takes a column or list of column label. It’s default value is None. After passing columns, it will consider them only for duplicates.
keep: This Controls how to consider duplicate value. It has only three distinct value and default is ‘first’.
- If ‘first’, This considers first value as unique and rest of the same values as duplicate.
- If ‘last’, This considers last value as unique and rest of the same values as duplicate.
- If ‘False’, This considers all of the same values as duplicates.
Returns: Boolean Series denoting duplicate rows.
Let’s create a simple dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’ and ‘City’.
Example 1 : Select duplicate rows based on all columns.
Here, We do not pass any argument therefore it takes default values for both the arguments i.e. subset = None and keep = ‘first’.
Example 2 : Select duplicate rows based on all columns.
If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.
Example 3 : If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.
Example 4 : Select duplicate rows based on more than one column names.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course