Find duplicate rows in a Dataframe based on all or selected columns

In this article, we will be discussing about how to find duplicate rows in a Dataframe based on all or a list of columns. For this we will use Dataframe.duplicated() method of Pandas.

Syntax : DataFrame.duplicated(subset = None, keep = ‘first’)

Parameters:
subset: This Takes a column or list of column label. It’s default value is None. After passing columns, it will consider them only for duplicates.

keep: This Controls how to consider duplicate value. It has only three distinct value and default is ‘first’.

  • If ‘first’, This considers first value as unique and rest of the same values as duplicate.
  • If ‘last’, This considers last value as unique and rest of the same values as duplicate.
  • If ‘False’, This considers all of the same values as duplicates.

Returns: Boolean Series denoting duplicate rows.



Let’s create a simple dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’ and ‘City’.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Import pandas library
import pandas as pd
  
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
  
# Creating a DataFrame object
df = pd.DataFrame(employees, 
                  columns = ['Name', 'Age', 'City'])
  
# Print the Dataframe
df

chevron_right


Output :
dataframe

Example 1 : Select duplicate rows based on all columns.
Here, We do not pass any argument therefore it takes default values for both the arguments i.e. subset = None and keep = ‘first’.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Import pandas library
import pandas as pd
  
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
  
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns = ['Name', 'Age', 'City'])
  
# Selecting duplicate rows except first 
# occurrence based on all columns
duplicate = df[df.duplicated()]
  
print("Duplicate Rows :")
  
# Print the resultant Dataframe
duplicate

chevron_right


Output :
Duplcate rows

Example 2 : Select duplicate rows based on all columns.
If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Import pandas library
import pandas as pd
  
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
  
# Creating a DataFrame object
df = pd.DataFrame(employees, 
                  columns = ['Name', 'Age', 'City'])
  
# Selecting duplicate rows except last 
# occurrence based on all columns.
duplicate = df[df.duplicated(keep = 'last')]
  
print("Duplicate Rows :")
  
# Print the resultant Dataframe
duplicate

chevron_right


Output :
Duplcate rows-2

Example 3 : If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.

filter_none

edit
close

play_arrow

link
brightness_4
code

# import pandas library
import pandas as pd
  
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
  
# Creating a DataFrame object
df = pd.DataFrame(employees, 
                  columns = ['Name', 'Age', 'City'])
  
# Selecting duplicate rows based
# on 'City' column
duplicate = df[df.duplicated('City')]
  
print("Duplicate Rows based on City :")
  
# Print the resultant Dataframe
duplicate

chevron_right


Output :
Duplcate rows-3

Example 4 : Select duplicate rows based on more than one column names.

filter_none

edit
close

play_arrow

link
brightness_4
code

# import pandas library
import pandas as pd
  
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
  
# Creating a DataFrame object  
df = pd.DataFrame(employees, 
                   columns = ['Name', 'Age', 'City'])
  
# Selecting duplicate rows based
# on list of column names
duplicate = df[df.duplicated(['Name', 'Age'])]
  
print("Duplicate Rows based on Name and Age :")
  
# Print the resultant Dataframe
duplicate

chevron_right


Output :
Duplcate rows-4




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.