Open In App

Find duplicate rows in a Dataframe based on all or selected columns

Last Updated : 04 Dec, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Duplicating rows in a DataFrame involves creating identical copies of existing rows within a tabular data structure, such as a pandas DataFrame, based on specified conditions or across all columns. This process allows for the replication of data to meet specific analytical or processing requirements. In this article, we will be discussing how to find duplicate rows in a Dataframe based on all or a list of columns. For this, we will use Dataframe.duplicated() method of Pandas.

Creating a Sample Pandas DataFrame

Let’s create a simple Dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’, and ‘City’. 

Python3




# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Print the Dataframe
df


Output

      Name  Age      City
0 Stuti 28 Varanasi
1 Saumya 32 Delhi
2 Aaditya 25 Mumbai
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
6 Aaditya 40 Dehradun
7 Seema 32 Delhi

Find All Duplicate Rows in a Pandas Dataframe

Below are the examples by which we can select duplicate rows in a DataFrame:

  • Select Duplicate Rows Based on All Columns
  • Get List of Duplicate Last Rows Based on All Columns
  • Select List Of Duplicate Rows Using Single Columns
  • Select List Of Duplicate Rows Using Multiple Columns
  • Select Duplicate Rows Using Sort Values

Select Duplicate Rows Based on All Columns

Here, We do not pass any argument, therefore, it takes default values for both the arguments i.e. subset = None and keep = ‘first’.

Python3




# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Selecting duplicate rows except first
# occurrence based on all columns
duplicate = df[df.duplicated()]
 
print("Duplicate Rows :")
 
# Print the resultant Dataframe
duplicate


Output

Duplicate Rows :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi

Get List of Duplicate Last Rows Based on All Columns

If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.

Python3




# Import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Selecting duplicate rows except last
# occurrence based on all columns.
duplicate = df[df.duplicated(keep='last')]
 
print("Duplicate Rows :")
 
# Print the resultant Dataframe
duplicate


Output

Duplicate Rows :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi

Select List Of Duplicate Rows Using Single Columns

If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.  

Python3




# import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
            ('Saumya', 32, 'Delhi'),
            ('Aaditya', 25, 'Mumbai'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Delhi'),
            ('Saumya', 32, 'Mumbai'),
            ('Aaditya', 40, 'Dehradun'),
            ('Seema', 32, 'Delhi')
            ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                columns = ['Name', 'Age', 'City'])
 
# Selecting duplicate rows based
# on 'City' column
duplicate = df[df.duplicated('City')]
 
print("Duplicate Rows based on City :")
 
# Print the resultant Dataframe
duplicate


Output

Duplicate Rows based on City :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
7 Saumya 32 Delhi

Select List Of Duplicate Rows Using Multiple Columns

In this example, a pandas DataFrame is created from a list of employee tuples with columns ‘Name,’ ‘Age,’ and ‘City.’ The code identifies and displays duplicate rows based on the ‘Name’ and ‘Age’ columns, highlighting instances where individuals share the same name and age.

Python3




# import pandas library
import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Selecting duplicate rows based
# on list of column names
duplicate = df[df.duplicated(['Name', 'Age'])]
 
print("Duplicate Rows based on Name and Age :")
 
# Print the resultant Dataframe
duplicate


Output

Duplicate Rows based on City :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai

Select Duplicate Rows Using Sort Values

In this example, a pandas DataFrame is created from a list of employee tuples, and duplicate rows based on the ‘Name’ and ‘Age’ columns are identified and displayed, with the resulting DataFrame sorted by the ‘Age’ column. The code showcases how to find and organize duplicate entries in a tabular data structure

Python3




import pandas as pd
 
# List of Tuples
employees = [('Stuti', 28, 'Varanasi'),
             ('Saumya', 32, 'Delhi'),
             ('Aaditya', 25, 'Mumbai'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Delhi'),
             ('Saumya', 32, 'Mumbai'),
             ('Aaditya', 40, 'Dehradun'),
             ('Seema', 32, 'Delhi')
             ]
 
# Creating a DataFrame object
df = pd.DataFrame(employees,
                  columns=['Name', 'Age', 'City'])
 
# Finding and sorting duplicate rows based on 'Name' and 'Age'
duplicate_sorted = df[df.duplicated(['Name', 'Age'], keep=False)].sort_values('Age')
 
print("Duplicate Rows based on Name and Age (sorted):")
 
# Print the resultant DataFrame
print(duplicate_sorted)


Output

Duplicate Rows based on Name and Age (sorted):
Name Age City
1 Saumya 32 Delhi
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai



Similar Reads

Apply a function to single or selected columns or rows in Pandas Dataframe
In this article, we will learn different ways to apply a function to single or selected columns or rows in Dataframe. We will use Dataframe/series.apply() method to apply a function. Apply a function to single row in Pandas DataframeHere, we will use different methods to apply a function to single rows by using Pandas dataframe. First lets create a
5 min read
Removing duplicate rows based on specific column in PySpark DataFrame
In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method: Syntax: dataframe.dropDuplicates(['column 1','column 2','column n']).show() where, dataframe is the in
1 min read
Sort rows or columns in Pandas Dataframe based on values
In this article, Let's discuss how to Sort rows or columns in Pandas Dataframe based on values. Pandas sort_values() method sorts a data frame in Ascending or Descending order of passed Column. It's different than the sorted Python function since it cannot sort a data frame and particular column cannot be selected. Syntax: DataFrame.sort_values(by,
7 min read
How to Find & Drop duplicate columns in a Pandas DataFrame?
Let’s discuss How to Find and drop duplicate columns in a Pandas DataFrame. First, Let’s create a simple Dataframe with column names 'Name', 'Age', 'Domicile', and 'Age'/'Marks'.  Find Duplicate Columns from a DataFrameTo find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any
4 min read
Find maximum values & position in columns and rows of a Dataframe in Pandas
In this article, we are going to discuss how to find the maximum value and its index position in columns and rows of a Dataframe. Create Dataframe to Find max values & position of columns or rows C/C++ Code import numpy as np import pandas as pd # List of Tuples matrix = [(10, 56, 17), (np.NaN, 23, 11), (49, 36, 55), (75, np.NaN, 34), (89, 21,
4 min read
Removing duplicate columns after DataFrame join in PySpark
In this article, we will discuss how to remove duplicate columns after a DataFrame join in PySpark. Create the first dataframe for demonstration: C/C++ Code # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark \ - example join').getOrCreate() # Create data in dat
3 min read
Drop duplicate rows in PySpark DataFrame
In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python. Let's create a sample Dataframe C/C++ Code # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving #
2 min read
How to display notnull rows and columns in a Python dataframe?
In Python, not null rows and columns mean the rows and columns which have Nan values, especially in the Pandas library. To display not null rows and columns in a python data frame we are going to use different methods as dropna(), notnull(), loc[]. dropna() : This function is used to remove rows and column which has missing values that are NaN valu
3 min read
Drop rows from Pandas dataframe with missing values or NaN in columns
Pandas provides various data structures and operations for manipulating numerical data and time series. However, there can be cases where some data might be missing. In Pandas missing data is represented by two value: None: None is a Python singleton object that is often used for missing data in Python code. NaN: NaN (an acronym for Not a Number),
4 min read
Get the number of rows and number of columns in Pandas Dataframe
Pandas provide data analysts a variety of pre-defined functions to Get the number of rows and columns in a data frame. In this article, we will learn about the syntax and implementation of few such functions. Method 1: Using df.axes() Method axes() method in pandas allows to get the number of rows and columns in a go. It accepts the argument '0' fo
3 min read
Practice Tags :