Open In App

Find duplicate rows in a Dataframe based on all or selected columns

Improve
Improve
Like Article
Like
Save
Share
Report

Duplicating rows in a DataFrame involves creating identical copies of existing rows within a tabular data structure, such as a pandas DataFrame, based on specified conditions or across all columns. This process allows for the replication of data to meet specific analytical or processing requirements. In this article, we will be discussing how to find duplicate rows in a Dataframe based on all or a list of columns. For this, we will use Dataframe.duplicated() method of Pandas.

Creating a Sample Pandas DataFrame

Let’s create a simple Dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’, and ‘City’. 

Python3




# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Print the Dataframe
df


Output

      Name  Age      City
0 Stuti 28 Varanasi
1 Saumya 32 Delhi
2 Aaditya 25 Mumbai
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
6 Aaditya 40 Dehradun
7 Seema 32 Delhi

Find All Duplicate Rows in a Pandas Dataframe

Below are the examples by which we can select duplicate rows in a DataFrame:

  • Select Duplicate Rows Based on All Columns
  • Get List of Duplicate Last Rows Based on All Columns
  • Select List Of Duplicate Rows Using Single Columns
  • Select List Of Duplicate Rows Using Multiple Columns
  • Select Duplicate Rows Using Sort Values

Select Duplicate Rows Based on All Columns

Here, We do not pass any argument, therefore, it takes default values for both the arguments i.e. subset = None and keep = ‘first’.

Python3




# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Selecting duplicate rows except first
# occurrence based on all columns
duplicate = df[df.duplicated()]
 
print("Duplicate Rows :")
 
# Print the resultant Dataframe
duplicate


Output

Duplicate Rows :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi

Get List of Duplicate Last Rows Based on All Columns

If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.

Python3




# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Selecting duplicate rows except last
# occurrence based on all columns.
duplicate = df[df.duplicated(keep='last')]
 
print("Duplicate Rows :")
 
# Print the resultant Dataframe
duplicate


Output

Duplicate Rows :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi

Select List Of Duplicate Rows Using Single Columns

If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.  

Python3




# import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                columns = ['Name', 'Age', 'City'])
 
# Selecting duplicate rows based
# on 'City' column
duplicate = df[df.duplicated('City')]
 
print("Duplicate Rows based on City :")
 
# Print the resultant Dataframe
duplicate


Output

Duplicate Rows based on City :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
7 Saumya 32 Delhi

Select List Of Duplicate Rows Using Multiple Columns

In this example, a pandas DataFrame is created from a list of employee tuples with columns ‘Name,’ ‘Age,’ and ‘City.’ The code identifies and displays duplicate rows based on the ‘Name’ and ‘Age’ columns, highlighting instances where individuals share the same name and age.

Python3




# import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Selecting duplicate rows based
# on list of column names
duplicate = df[df.duplicated(['Name', 'Age'])]
 
print("Duplicate Rows based on Name and Age :")
 
# Print the resultant Dataframe
duplicate


Output

Duplicate Rows based on City :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai

Select Duplicate Rows Using Sort Values

In this example, a pandas DataFrame is created from a list of employee tuples, and duplicate rows based on the ‘Name’ and ‘Age’ columns are identified and displayed, with the resulting DataFrame sorted by the ‘Age’ column. The code showcases how to find and organize duplicate entries in a tabular data structure

Python3




import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Finding and sorting duplicate rows based on 'Name' and 'Age'
duplicate_sorted = df[df.duplicated(['Name', 'Age'], keep=False)].sort_values('Age')
 
print("Duplicate Rows based on Name and Age (sorted):")
 
# Print the resultant DataFrame
print(duplicate_sorted)


Output

Duplicate Rows based on Name and Age (sorted):
Name Age City
1 Saumya 32 Delhi
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai



Last Updated : 04 Dec, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads