Skip to content
Related Articles

Related Articles

Improve Article

Find duplicate rows in a Dataframe based on all or selected columns

  • Last Updated : 02 Jul, 2020

In this article, we will be discussing about how to find duplicate rows in a Dataframe based on all or a list of columns. For this we will use Dataframe.duplicated() method of Pandas.

Syntax : DataFrame.duplicated(subset = None, keep = ‘first’)

Parameters:
subset: This Takes a column or list of column label. It’s default value is None. After passing columns, it will consider them only for duplicates.

keep: This Controls how to consider duplicate value. It has only three distinct value and default is ‘first’.

  • If ‘first’, This considers first value as unique and rest of the same values as duplicate.
  • If ‘last’, This considers last value as unique and rest of the same values as duplicate.
  • If ‘False’, This considers all of the same values as duplicates.

Returns: Boolean Series denoting duplicate rows.



Let’s create a simple dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’ and ‘City’.




# Import pandas library
import pandas as pd
  
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
  
# Creating a DataFrame object
df = pd.DataFrame(employees, 
                  columns = ['Name', 'Age', 'City'])
  
# Print the Dataframe
df

Output :
dataframe

Example 1 : Select duplicate rows based on all columns.
Here, We do not pass any argument therefore it takes default values for both the arguments i.e. subset = None and keep = ‘first’.




# Import pandas library
import pandas as pd
  
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
  
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns = ['Name', 'Age', 'City'])
  
# Selecting duplicate rows except first 
# occurrence based on all columns
duplicate = df[df.duplicated()]
  
print("Duplicate Rows :")
  
# Print the resultant Dataframe
duplicate

Output :
Duplcate rows

Example 2 : Select duplicate rows based on all columns.
If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.




# Import pandas library
import pandas as pd
  
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
  
# Creating a DataFrame object
df = pd.DataFrame(employees, 
                  columns = ['Name', 'Age', 'City'])
  
# Selecting duplicate rows except last 
# occurrence based on all columns.
duplicate = df[df.duplicated(keep = 'last')]
  
print("Duplicate Rows :")
  
# Print the resultant Dataframe
duplicate

Output :
Duplcate rows-2

Example 3 : If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.




# import pandas library
import pandas as pd
  
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
  
# Creating a DataFrame object
df = pd.DataFrame(employees, 
                  columns = ['Name', 'Age', 'City'])
  
# Selecting duplicate rows based
# on 'City' column
duplicate = df[df.duplicated('City')]
  
print("Duplicate Rows based on City :")
  
# Print the resultant Dataframe
duplicate

Output :
Duplcate rows-3

Example 4 : Select duplicate rows based on more than one column names.




# import pandas library
import pandas as pd
  
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
  
# Creating a DataFrame object  
df = pd.DataFrame(employees, 
                   columns = ['Name', 'Age', 'City'])
  
# Selecting duplicate rows based
# on list of column names
duplicate = df[df.duplicated(['Name', 'Age'])]
  
print("Duplicate Rows based on Name and Age :")
  
# Print the resultant Dataframe
duplicate

Output :
Duplcate rows-4

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :