Find duplicate rows in a Dataframe based on all or selected columns

Last Updated : 04 Dec, 2023

Duplicating rows in a DataFrame involves creating identical copies of existing rows within a tabular data structure, such as a pandas DataFrame, based on specified conditions or across all columns. This process allows for the replication of data to meet specific analytical or processing requirements. In this article, we will be discussing how to find duplicate rows in a Dataframe based on all or a list of columns. For this, we will use Dataframe.duplicated() method of Pandas.

Creating a Sample Pandas DataFrame

Let’s create a simple Dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’, and ‘City’.

Python3

# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Print the Dataframe
df

Output

      Name  Age      City
0    Stuti   28  Varanasi
1   Saumya   32     Delhi
2  Aaditya   25    Mumbai
3   Saumya   32     Delhi
4   Saumya   32     Delhi
5   Saumya   32    Mumbai
6  Aaditya   40  Dehradun
7    Seema   32     Delhi

Find All Duplicate Rows in a Pandas Dataframe

Below are the examples by which we can select duplicate rows in a DataFrame:

Select Duplicate Rows Based on All Columns
Get List of Duplicate Last Rows Based on All Columns
Select List Of Duplicate Rows Using Single Columns
Select List Of Duplicate Rows Using Multiple Columns
Select Duplicate Rows Using Sort Values

Select Duplicate Rows Based on All Columns

Here, We do not pass any argument, therefore, it takes default values for both the arguments i.e. subset = None and keep = ‘first’.

Python3

# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Selecting duplicate rows except first
# occurrence based on all columns
duplicate = df[df.duplicated()]
 
print("Duplicate Rows :")
 
# Print the resultant Dataframe
duplicate

Output

Duplicate Rows :
     Name  Age      City
3  Saumya   32     Delhi
4  Saumya   32     Delhi

Get List of Duplicate Last Rows Based on All Columns

If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.

Python3

# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Selecting duplicate rows except last
# occurrence based on all columns.
duplicate = df[df.duplicated(keep='last')]
 
print("Duplicate Rows :")
 
# Print the resultant Dataframe
duplicate

Output

Duplicate Rows :
     Name  Age      City
3  Saumya   32     Delhi
4  Saumya   32     Delhi

Select List Of Duplicate Rows Using Single Columns

If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.

Python3

# import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees, 
                columns = ['Name', 'Age', 'City'])
 
# Selecting duplicate rows based
# on 'City' column
duplicate = df[df.duplicated('City')]
 
print("Duplicate Rows based on City :")
 
# Print the resultant Dataframe
duplicate

Output

Duplicate Rows based on City :
     Name  Age  City
3  Saumya   32  Delhi
4  Saumya   32  Delhi
5  Saumya   32  Mumbai
7  Saumya   32  Delhi

Select List Of Duplicate Rows Using Multiple Columns

In this example, a pandas DataFrame is created from a list of employee tuples with columns ‘Name,’ ‘Age,’ and ‘City.’ The code identifies and displays duplicate rows based on the ‘Name’ and ‘Age’ columns, highlighting instances where individuals share the same name and age.

Python3

# import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Selecting duplicate rows based
# on list of column names
duplicate = df[df.duplicated(['Name', 'Age'])]
 
print("Duplicate Rows based on Name and Age :")
 
# Print the resultant Dataframe
duplicate

Output

Duplicate Rows based on City :
     Name  Age  City
3  Saumya   32  Delhi
4  Saumya   32  Delhi
5  Saumya   32  Mumbai

Select Duplicate Rows Using Sort Values

In this example, a pandas DataFrame is created from a list of employee tuples, and duplicate rows based on the ‘Name’ and ‘Age’ columns are identified and displayed, with the resulting DataFrame sorted by the ‘Age’ column. The code showcases how to find and organize duplicate entries in a tabular data structure

Python3

import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Finding and sorting duplicate rows based on 'Name' and 'Age'
duplicate_sorted = df[df.duplicated(['Name', 'Age'], keep=False)].sort_values('Age')
 
print("Duplicate Rows based on Name and Age (sorted):")
 
# Print the resultant DataFrame
print(duplicate_sorted)

Output

Duplicate Rows based on Name and Age (sorted):
     Name  Age    City
1  Saumya   32   Delhi
3  Saumya   32   Delhi
4  Saumya   32   Delhi
5  Saumya   32  Mumbai

Suggest improvement

Removing duplicate rows based on specific column in PySpark DataFrame

Share your thoughts in the comments

Find duplicate rows in a Dataframe based on all or selected columns

Python3

Find All Duplicate Rows in a Pandas Dataframe

Select Duplicate Rows Based on All Columns

Python3

Get List of Duplicate Last Rows Based on All Columns

Python3

Select List Of Duplicate Rows Using Single Columns

Python3

Select List Of Duplicate Rows Using Multiple Columns

Python3

Select Duplicate Rows Using Sort Values

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?