Python | Pandas dataframe.drop_duplicates()

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

An important part of Data analysis is analyzing Duplicate Values and removing them. Pandas drop_duplicates() method helps in removing duplicates from the data frame.

Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

Parameters:
subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.
keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’.

  • If ‘first’, it considers first value as unique and rest of the same values as duplicate.
  • If ‘last’, it considers last value as unique and rest of the same values as duplicate.
  • If False, it consider all of the same values as duplicates

inplace: Boolean values, removes rows with duplicates if True.

Return type: DataFrame with removed duplicate rows depending on Arguments passed.

To download the CSV file used, Click Here.

Example #1: Removing rows with same First Name
In the following example, rows having same First Name are removed and a new data frame is returned.

filter_none

edit
close

play_arrow

link
brightness_4
code

# importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("employees.csv")
  
# sorting by first name
data.sort_values("First Name", inplace = True)
  
# dropping ALL duplicte values
data.drop_duplicates(subset ="First Name",
                     keep = False, inplace = True)
  
# displaying data
data

chevron_right


Output:
As shown in the image, the rows with same names were removed from data frame.

Example #2: Removing rows with all duplicate values
In this example, rows having all values will be removed. Since the csv file isn’t having such a row, a random row is duplicated and inserted in data frame first.

filter_none

edit
close

play_arrow

link
brightness_4
code

#importing pandas package
import pandas as pd
  
# making data frame from csv file
data = pd.read_csv("employees.csv")
  
#length before adding row
length1 = len(data)
  
# manually inserting duplicate of a row of row 440
data.loc[1001] = [data["First Name"][440],
                  data["Gender"][440],
                  data["Start Date"][440],
                  data["Last Login Time"][440],
                  data["Salary"][440],
                  data["Bonus %"][440],
                  data["Senior Management"][440],
                  data["Team"][440]]
                    
  
# length after adding row
length2=  len(data)
  
# sorting by first name
data.sort_values("First Name", inplace=True)
  
# dropping duplicate values
data.drop_duplicates(keep=False,inplace=True)
  
# length after removing duplicates
length3=len(data)
  
# printing all data frame lengths 
print(length1, length2, length3)

chevron_right


Output:
As shown in the output image, the length after removing duplicates is 999. Since the keep parameter was set to False, all of the duplicate rows were removed.



My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.