Find duplicate rows in a Dataframe based on all or selected columns
Duplicating rows in a DataFrame involves creating identical copies of existing rows within a tabular data structure, such as a pandas DataFrame, based on specified conditions or across all columns. This process allows for the replication of data to meet specific analytical or processing requirements. In this article, we will be discussing how to find duplicate rows in a Dataframe based on all or a list of columns. For this, we will use Dataframe.duplicated() method of Pandas.
Creating a Sample Pandas DataFrame
Let’s create a simple Dataframe with a dictionary of lists, say column names are: ‘Name’, ‘Age’, and ‘City’.
Python3
import pandas as pd
employees = [( 'Stuti' , 28 , 'Varanasi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Aaditya' , 25 , 'Mumbai' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Mumbai' ),
( 'Aaditya' , 40 , 'Dehradun' ),
( 'Seema' , 32 , 'Delhi' )
]
df = pd.DataFrame(employees,
columns = [ 'Name' , 'Age' , 'City' ])
df
|
Output
Name Age City
0 Stuti 28 Varanasi
1 Saumya 32 Delhi
2 Aaditya 25 Mumbai
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
6 Aaditya 40 Dehradun
7 Seema 32 Delhi
Find All Duplicate Rows in a Pandas Dataframe
Below are the examples by which we can select duplicate rows in a DataFrame:
- Select Duplicate Rows Based on All Columns
- Get List of Duplicate Last Rows Based on All Columns
- Select List Of Duplicate Rows Using Single Columns
- Select List Of Duplicate Rows Using Multiple Columns
- Select Duplicate Rows Using Sort Values
Select Duplicate Rows Based on All Columns
Here, We do not pass any argument, therefore, it takes default values for both the arguments i.e. subset = None and keep = ‘first’.
Python3
import pandas as pd
employees = [( 'Stuti' , 28 , 'Varanasi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Aaditya' , 25 , 'Mumbai' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Mumbai' ),
( 'Aaditya' , 40 , 'Dehradun' ),
( 'Seema' , 32 , 'Delhi' )
]
df = pd.DataFrame(employees,
columns = [ 'Name' , 'Age' , 'City' ])
duplicate = df[df.duplicated()]
print ( "Duplicate Rows :" )
duplicate
|
Output
Duplicate Rows :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
Get List of Duplicate Last Rows Based on All Columns
If you want to consider all duplicates except the last one then pass keep = ‘last’ as an argument.
Python3
import pandas as pd
employees = [( 'Stuti' , 28 , 'Varanasi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Aaditya' , 25 , 'Mumbai' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Mumbai' ),
( 'Aaditya' , 40 , 'Dehradun' ),
( 'Seema' , 32 , 'Delhi' )
]
df = pd.DataFrame(employees,
columns = [ 'Name' , 'Age' , 'City' ])
duplicate = df[df.duplicated(keep = 'last' )]
print ( "Duplicate Rows :" )
duplicate
|
Output
Duplicate Rows :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
Select List Of Duplicate Rows Using Single Columns
If you want to select duplicate rows based only on some selected columns then pass the list of column names in subset as an argument.
Python3
import pandas as pd
employees = [( 'Stuti' , 28 , 'Varanasi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Aaditya' , 25 , 'Mumbai' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Mumbai' ),
( 'Aaditya' , 40 , 'Dehradun' ),
( 'Seema' , 32 , 'Delhi' )
]
df = pd.DataFrame(employees,
columns = [ 'Name' , 'Age' , 'City' ])
duplicate = df[df.duplicated( 'City' )]
print ( "Duplicate Rows based on City :" )
duplicate
|
Output
Duplicate Rows based on City :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
7 Saumya 32 Delhi
Select List Of Duplicate Rows Using Multiple Columns
In this example, a pandas DataFrame is created from a list of employee tuples with columns ‘Name,’ ‘Age,’ and ‘City.’ The code identifies and displays duplicate rows based on the ‘Name’ and ‘Age’ columns, highlighting instances where individuals share the same name and age.
Python3
import pandas as pd
employees = [( 'Stuti' , 28 , 'Varanasi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Aaditya' , 25 , 'Mumbai' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Mumbai' ),
( 'Aaditya' , 40 , 'Dehradun' ),
( 'Seema' , 32 , 'Delhi' )
]
df = pd.DataFrame(employees,
columns = [ 'Name' , 'Age' , 'City' ])
duplicate = df[df.duplicated([ 'Name' , 'Age' ])]
print ( "Duplicate Rows based on Name and Age :" )
duplicate
|
Output
Duplicate Rows based on City :
Name Age City
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
Select Duplicate Rows Using Sort Values
In this example, a pandas DataFrame is created from a list of employee tuples, and duplicate rows based on the ‘Name’ and ‘Age’ columns are identified and displayed, with the resulting DataFrame sorted by the ‘Age’ column. The code showcases how to find and organize duplicate entries in a tabular data structure
Python3
import pandas as pd
employees = [( 'Stuti' , 28 , 'Varanasi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Aaditya' , 25 , 'Mumbai' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Delhi' ),
( 'Saumya' , 32 , 'Mumbai' ),
( 'Aaditya' , 40 , 'Dehradun' ),
( 'Seema' , 32 , 'Delhi' )
]
df = pd.DataFrame(employees,
columns = [ 'Name' , 'Age' , 'City' ])
duplicate_sorted = df[df.duplicated([ 'Name' , 'Age' ], keep = False )].sort_values( 'Age' )
print ( "Duplicate Rows based on Name and Age (sorted):" )
print (duplicate_sorted)
|
Output
Duplicate Rows based on Name and Age (sorted):
Name Age City
1 Saumya 32 Delhi
3 Saumya 32 Delhi
4 Saumya 32 Delhi
5 Saumya 32 Mumbai
Last Updated :
04 Dec, 2023
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...