Drop Duplicates Ignoring One Column-Pandas

Last Updated : 07 Mar, 2024

Pandas provide various features for users to implement on datasets. One such feature is dropping the duplicate rows, which can be done using the drop_duplicates function available in Pandas. There are some cases where the user wants to eliminate the duplicates but does not consider any certain column while removing duplicates. We will explore four approaches to drop duplicates ignoring one column in pandas.

Drop Duplicates Ignoring One Column-Pandas

Using the subset parameter
Using duplicated and boolean indexing
Using drop_duplicates and keep parameter
Using group by and first

Using the subset parameter

The drop_duplicates function has one crucial parameter, called subset, which allows the user to put the function only on specified columns. In this method, we will see how to drop the duplicates ignoring one column by stating other columns that we don’t want to ignore as a list in the subset parameter.

Syntax:

dropped_df = df.drop_duplicates(subset=[‘#column-1’, ‘#column-2’])

Here,

column-1, column-2: These are the columns that you don’t want to ignore.

column-3: It is the column that you want to ignore.

df: It is the data frame from which duplicates need to be dropped.

In the below example, we have defined the data with three columns, first_name, last_name, and fees. Then, we have removed the duplicates ignoring the first_name column, by stating the last_name and fees columns in the subset parameter.

Python3

# Import the pandas library
import pandas as pd
 
# Define the data
data = {'first_name': ['Arun', 'Ishita', 'Ruchir', 'Vinayak'], 'last_name': [
    'Kumar', 'Rai', 'Jha', 'Rai'], 'fees': [5000, 6000, 5000, 6000]}
 
# Convert data to Pandas dataframe
df = pd.DataFrame(data)
 
# Print the actual data frame
print('Actual DataFrame:\n', df)
 
# Defining the list of columns that you want to consider
dropped_df = df.drop_duplicates(subset=['last_name', 'fees'])
 
# Print the dataframe without duplicates
print('DataFrame after removing duplicates:\n', dropped_df)

Output:

Actual DataFrame:
       first_name     last_name        fees
0       Arun                 Kumar            5000
1       Ishita                 Rai                  6000
2      Ruchir                Jha                  5000
3     Vinayak              Rai                  6000
DataFrame after removing duplicates:
      first_name         last_name      fees
0       Arun                  Kumar           5000
1        Ishita                 Rai                 6000
2       Ruchir                Jha                 5000

Using duplicated() and Boolean Indexing

The ~ arrow denotes the boolean indexing for the dataset, while the duplicated() function gives true or false as an output denoting if the row is duplicate or not. In this approach, we will see how to drop duplicates ignoring one column using duplicated and boolean indexing.

Syntax:

dropped_df = df[~df.duplicated(subset=[‘#column-1’, ‘#column-2’])]

Here,

column-1, column-2: These are the columns that you don’t want to ignore.

df: It is the data frame from which duplicates need to be dropped.

In the below example, we have defined the data with three columns, first_name, last_name, and fees. Then, we removed the duplicates ignoring the first_name column, using duplicated and boolean indexing.

Python3

# Import the pandas library
import pandas as pd
 
# Define the data
data = {'first_name': ['Arun', 'Ishita', 'Ruchir', 'Vinayak'], 'last_name': [
    'Kumar', 'Rai', 'Jha', 'Rai'], 'fees': [5000, 6000, 5000, 6000]}
 
# Convert data to Pandas dataframe
df = pd.DataFrame(data)
 
# Print the actual data frame
print('Actual DataFrame:\n', df)
 
# Dropping the duplicates using duplicated and boolean indexing
dropped_df = df[~df.duplicated(subset=['last_name', 'fees'])]
 
# Print the dataframe without duplicates
print('DataFrame after removing duplicates:\n', dropped_df)

Output:

Actual DataFrame:
       first_name     last_name        fees
0       Arun                 Kumar            5000
1       Ishita                 Rai                  6000
2      Ruchir                Jha                  5000
3     Vinayak              Rai                  6000
DataFrame after removing duplicates:
      first_name         last_name      fees
0       Arun                  Kumar           5000
1        Ishita                 Rai                 6000
2       Ruchir                Jha                 5000

Using drop_duplicates and keep Parameter

The function dataframe.columns.difference() allows the users to create a new data frame keeping certain columns and ignoring certain columns. In this method, we will first create a new data frame ignoring the column to be ignored, and then remove duplicates from the new data frame.

Syntax:

dropped_df=df.drop_duplicates(subset=source_df.columns.difference([‘#column-3’]))

Here,

column-1, column-2: These are the columns that you don’t want to ignore

column-3: It is the column that you want to ignore

df: It is the data frame from which duplicates need to be dropped.

In the below example, we have defined the data with three columns, first_name, last_name, and fees. Then, we removed the duplicates ignoring the first_name column, by stating the first_name column in the difference function.

Python3

# Import the pandas library
import pandas as pd
 
# Define the data
data = {'first_name': ['Arun', 'Ishita', 'Ruchir', 'Vinayak'], 'last_name': [
    'Kumar', 'Rai', 'Jha', 'Rai'], 'fees': [5000, 6000, 5000, 6000]}
 
# Convert data to Pandas dataframe
df = pd.DataFrame(data)
 
# Print the actual data frame
print('Actual DataFrame:\n', df)
 
# Stating the column that you want to ignore
dropped_df = df.drop_duplicates(subset=df.columns.difference(['first_name']))
 
# Print the dataframe without duplicates
print('DataFrame after removing duplicates:\n', dropped_df)

Output:

Actual DataFrame:
       first_name     last_name        fees
0       Arun                 Kumar            5000
1       Ishita                 Rai                  6000
2      Ruchir                Jha                  5000
3     Vinayak              Rai                  6000
DataFrame after removing duplicates:
      first_name         last_name      fees
0       Arun                  Kumar           5000
1        Ishita                 Rai                 6000
2       Ruchir                Jha                 5000

Using groupby() and first() fUNCTION

The way to remove all other duplicates keeping the first one is called the first function, while the way of grouping large amounts of data is called groupby() function. In this method, we will see how to drop duplicates ignoring one column using group by and first function.

Syntax:

dropped_df = df.groupby([‘#column-1’, ‘#column-2’]).first()

Here,

column-1, column-2: These are the columns that you don’t want to ignore.

df: It is the data frame from which duplicates need to be dropped.

In the below example, we have defined the data with three columns, first_name, last_name, and fees. Then, we removed the duplicates ignoring the first_name column, using group by and first.

Python3

# Import the pandas library
import pandas as pd
 
# Define the data
data = {'first_name': ['Arun', 'Ishita', 'Ruchir', 'Vinayak'], 'last_name': [
    'Kumar', 'Rai', 'Jha', 'Rai'], 'fees': [5000, 6000, 5000, 6000]}
 
# Convert data to Pandas dataframe
df = pd.DataFrame(data)
 
# Print the actual data frame
print('Actual DataFrame:\n', df)
 
# Dropping the duplicates using groupby and first
dropped_df = df.groupby(['last_name', 'fees']).first()
 
# Print the dataframe without duplicates
print('DataFrame after removing duplicates:\n', dropped_df)

Output:

Actual DataFrame:
       first_name     last_name        fees
0       Arun                 Kumar            5000
1       Ishita                 Rai                  6000
2      Ruchir                Jha                  5000
3     Vinayak              Rai                  6000
DataFrame after removing duplicates:
      first_name         last_name      fees
0       Arun                  Kumar           5000
1        Ishita                 Rai                 6000
2       Ruchir                Jha                 5000

Suggest improvement

How to Find & Drop duplicate columns in a Pandas DataFrame?

Share your thoughts in the comments

Drop Duplicates Ignoring One Column-Pandas