Open In App

Drop Duplicates Ignoring One Column-Pandas

Last Updated : 07 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Pandas provide various features for users to implement on datasets. One such feature is dropping the duplicate rows, which can be done using the drop_duplicates function available in Pandas. There are some cases where the user wants to eliminate the duplicates but does not consider any certain column while removing duplicates. We will explore four approaches to drop duplicates ignoring one column in pandas.

Drop Duplicates Ignoring One Column-Pandas

  • Using the subset parameter
  • Using duplicated and boolean indexing
  • Using drop_duplicates and keep parameter
  • Using group by and first

Using the subset parameter

The drop_duplicates function has one crucial parameter, called subset, which allows the user to put the function only on specified columns. In this method, we will see how to drop the duplicates ignoring one column by stating other columns that we don’t want to ignore as a list in the subset parameter.

Syntax:

dropped_df = df.drop_duplicates(subset=[‘#column-1’, ‘#column-2’])

Here,

  • column-1, column-2: These are the columns that you don’t want to ignore.
  • column-3: It is the column that you want to ignore.
  • df: It is the data frame from which duplicates need to be dropped.

In the below example, we have defined the data with three columns, first_name, last_name, and fees. Then, we have removed the duplicates ignoring the first_name column, by stating the last_name and fees columns in the subset parameter.

Python3




# Import the pandas library
import pandas as pd
 
# Define the data
data = {'first_name': ['Arun', 'Ishita', 'Ruchir', 'Vinayak'], 'last_name': [
    'Kumar', 'Rai', 'Jha', 'Rai'], 'fees': [5000, 6000, 5000, 6000]}
 
# Convert data to Pandas dataframe
df = pd.DataFrame(data)
 
# Print the actual data frame
print('Actual DataFrame:\n', df)
 
# Defining the list of columns that you want to consider
dropped_df = df.drop_duplicates(subset=['last_name', 'fees'])
 
# Print the dataframe without duplicates
print('DataFrame after removing duplicates:\n', dropped_df)


Output:

Actual DataFrame:
first_name last_name fees
0 Arun Kumar 5000
1 Ishita Rai 6000
2 Ruchir Jha 5000
3 Vinayak Rai 6000
DataFrame after removing duplicates:
first_name last_name fees
0 Arun Kumar 5000
1 Ishita Rai 6000
2 Ruchir Jha 5000

Using duplicated() and Boolean Indexing

The ~ arrow denotes the boolean indexing for the dataset, while the duplicated() function gives true or false as an output denoting if the row is duplicate or not. In this approach, we will see how to drop duplicates ignoring one column using duplicated and boolean indexing.

Syntax:

dropped_df = df[~df.duplicated(subset=[‘#column-1’, ‘#column-2’])]

Here,

  • column-1, column-2: These are the columns that you don’t want to ignore.
  • df: It is the data frame from which duplicates need to be dropped.

In the below example, we have defined the data with three columns, first_name, last_name, and fees. Then, we removed the duplicates ignoring the first_name column, using duplicated and boolean indexing.

Python3




# Import the pandas library
import pandas as pd
 
# Define the data
data = {'first_name': ['Arun', 'Ishita', 'Ruchir', 'Vinayak'], 'last_name': [
    'Kumar', 'Rai', 'Jha', 'Rai'], 'fees': [5000, 6000, 5000, 6000]}
 
# Convert data to Pandas dataframe
df = pd.DataFrame(data)
 
# Print the actual data frame
print('Actual DataFrame:\n', df)
 
# Dropping the duplicates using duplicated and boolean indexing
dropped_df = df[~df.duplicated(subset=['last_name', 'fees'])]
 
# Print the dataframe without duplicates
print('DataFrame after removing duplicates:\n', dropped_df)


Output:

Actual DataFrame:
first_name last_name fees
0 Arun Kumar 5000
1 Ishita Rai 6000
2 Ruchir Jha 5000
3 Vinayak Rai 6000
DataFrame after removing duplicates:
first_name last_name fees
0 Arun Kumar 5000
1 Ishita Rai 6000
2 Ruchir Jha 5000

Using drop_duplicates and keep Parameter

The function dataframe.columns.difference() allows the users to create a new data frame keeping certain columns and ignoring certain columns. In this method, we will first create a new data frame ignoring the column to be ignored, and then remove duplicates from the new data frame.

Syntax:

dropped_df=df.drop_duplicates(subset=source_df.columns.difference([‘#column-3’]))

Here,

  • column-1, column-2: These are the columns that you don’t want to ignore
  • column-3: It is the column that you want to ignore
  • df: It is the data frame from which duplicates need to be dropped.

In the below example, we have defined the data with three columns, first_name, last_name, and fees. Then, we removed the duplicates ignoring the first_name column, by stating the first_name column in the difference function.

Python3




# Import the pandas library
import pandas as pd
 
# Define the data
data = {'first_name': ['Arun', 'Ishita', 'Ruchir', 'Vinayak'], 'last_name': [
    'Kumar', 'Rai', 'Jha', 'Rai'], 'fees': [5000, 6000, 5000, 6000]}
 
# Convert data to Pandas dataframe
df = pd.DataFrame(data)
 
# Print the actual data frame
print('Actual DataFrame:\n', df)
 
# Stating the column that you want to ignore
dropped_df = df.drop_duplicates(subset=df.columns.difference(['first_name']))
 
# Print the dataframe without duplicates
print('DataFrame after removing duplicates:\n', dropped_df)


Output:

Actual DataFrame:
first_name last_name fees
0 Arun Kumar 5000
1 Ishita Rai 6000
2 Ruchir Jha 5000
3 Vinayak Rai 6000
DataFrame after removing duplicates:
first_name last_name fees
0 Arun Kumar 5000
1 Ishita Rai 6000
2 Ruchir Jha 5000

Using groupby() and first() fUNCTION

The way to remove all other duplicates keeping the first one is called the first function, while the way of grouping large amounts of data is called groupby() function. In this method, we will see how to drop duplicates ignoring one column using group by and first function.

Syntax:

dropped_df = df.groupby([‘#column-1’, ‘#column-2’]).first()

Here,

  • column-1, column-2: These are the columns that you don’t want to ignore.
  • df: It is the data frame from which duplicates need to be dropped.

In the below example, we have defined the data with three columns, first_name, last_name, and fees. Then, we removed the duplicates ignoring the first_name column, using group by and first.

Python3




# Import the pandas library
import pandas as pd
 
# Define the data
data = {'first_name': ['Arun', 'Ishita', 'Ruchir', 'Vinayak'], 'last_name': [
    'Kumar', 'Rai', 'Jha', 'Rai'], 'fees': [5000, 6000, 5000, 6000]}
 
# Convert data to Pandas dataframe
df = pd.DataFrame(data)
 
# Print the actual data frame
print('Actual DataFrame:\n', df)
 
# Dropping the duplicates using groupby and first
dropped_df = df.groupby(['last_name', 'fees']).first()
 
# Print the dataframe without duplicates
print('DataFrame after removing duplicates:\n', dropped_df)


Output:

Actual DataFrame:
first_name last_name fees
0 Arun Kumar 5000
1 Ishita Rai 6000
2 Ruchir Jha 5000
3 Vinayak Rai 6000
DataFrame after removing duplicates:
first_name last_name fees
0 Arun Kumar 5000
1 Ishita Rai 6000
2 Ruchir Jha 5000


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads