Python | Pandas dataframe.drop_duplicates()

Last Updated : 12 Mar, 2024

Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.

dataframe.drop_duplicates() Syntax in Python

Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

Parameters:

subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.
keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’.
If ‘first‘, it considers first value as unique and rest of the same values as duplicate.
If ‘last‘, it considers last value as unique and rest of the same values as duplicate.
If False, it consider all of the same values as duplicates
inplace: Boolean values, removes rows with duplicates if True.
Return type: DataFrame with removed duplicate rows depending on Arguments passed.

Python dataframe.drop_duplicates() Example

Below, we are discussing example of dataframe.drop_duplicates() method as its also use for Removing duplicates with Pandas in Python, example are :

Pandas DataFrame Drop Duplicates

As we can see one of the TeamA and team has been dropped due to duplicate value.

Python3


                    import pandas as pd

data = {
    "A": ["TeamA", "TeamB", "TeamB", "TeamC", "TeamA"],
    "B": [50, 40, 40, 30, 50],
    "C": [True, False, False, False, True]
}

df = pd.DataFrame(data)

display(df.drop_duplicates())

Output:

    A        B    C
0    TeamA    50    True
1    TeamB    40    False
3    TeamC    30    False

Managing Duplicate Data Using dataframe.drop_duplicates()

In this example , we manages student data, showcasing techniques to removing duplicates with Pandas in Python, removing all duplicates, and deleting duplicates based on specific columns then the last part demonstrates making names case-insensitive while preserving the first occurrence.

Python3


                    import pandas as pd

# Sample data for student names and ages
students_data = {
    'Name': ["Alice", "Bob", "Charlie", "David", "Alice", "Eva", "Bob"],
    'Age': [20, 22, 21, 23, 20, 19, 22],
    'Grade': [85, 90, 78, 92, 85, 88, 90],
    'Attendance': [95, 92, 88, 93, 95, 96, 92]
}

# Create a DataFrame
students_df = pd.DataFrame(students_data)
print(students_df)

# Remove all duplicate rows
students_df_no_duplicates = students_df.drop_duplicates(keep=False)
print(students_df_no_duplicates)

# Delete duplicate rows based on specific columns (Name and Age)
students_df_specific_columns = students_df.drop_duplicates(subset=["Name", "Age"], keep=False)
print(students_df_specific_columns)

# Using DataFrame.apply() and lambda function to make names case-insensitive and keep the first occurrence
students_df_case_insensitive = students_df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Name', 'Age'], keep='first')
print(students_df_case_insensitive)

Output:

     Name  Age  Grade  Attendance
0   Alice   20     85          95
1     Bob   22     90          92
2  Charlie   21     78          88
3   David   23     92          93
4   Alice   20     85          95
5     Eva   19     88          96
6     Bob   22     90          92

     Name  Age  Grade  Attendance
2  Charlie   21     78          88
3   David   23     92          93
5     Eva   19     88          96

    Name  Age  Grade  Attendance
5   Eva   19     88          96

    Name  Age  Grade  Attendance
5   Eva   19     88          96

To download the CSV file used, Click Here.

Removing Rows with the Same First Name

In the following example, rows having the same First Name are removed and a new data frame is returned.

Python3


                    # importing pandas package
import pandas as pd

# making data frame from csv file
data = pd.read_csv("employees.csv")

# sorting by first name
data.sort_values("First Name", inplace=True)

# dropping ALL duplicate values
data.drop_duplicates(subset="First Name",
                     keep=False, inplace=True)

# displaying data
data

# length before adding row length1 = len(data) # manually inserting duplicate of a row of row 440 data.loc[1001] = [data["First Name"][440], data["Gender"][440], data["Start Date"][440], data["Last Login Time"][440], data["Salary"][440], data["Bonus %"][440], data["Senior Management"][440], data["Team"][440]] # length after adding row length2 = len(data) # sorting by first name data.sort_values("First Name", inplace=True) # dropping duplicate values data.drop_duplicates(keep=False, inplace=True) # length after removing duplicates length3 = len(data) # printing all data frame lengths print(length1, length2, length3)

Python | Pandas dataframe.drop_duplicates()

dataframe.drop_duplicates() Syntax in Python

Python dataframe.drop_duplicates() Example

Pandas DataFrame Drop Duplicates

Managing Duplicate Data Using dataframe.drop_duplicates()

Removing Rows with the Same First Name

Removing Rows with all Duplicate Values

Similar Reads

Python | Pandas dataframe.drop_duplicates()

dataframe.drop_duplicates() Syntax in Python

Python dataframe.drop_duplicates() Example

Pandas DataFrame Drop Duplicates

Managing Duplicate Data Using dataframe.drop_duplicates()

Removing Rows with the Same First Name

Removing Rows with all Duplicate Values

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?