Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.
Syntax of df.drop_duplicates()
Syntax: DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)
Parameters:
- subset: Subset takes a column or list of column label. It’s default value is none. After passing columns, it will consider them only for duplicates.
- keep: keep is to control how to consider duplicate value. It has only three distinct value and default is ‘first’.
- If ‘first‘, it considers first value as unique and rest of the same values as duplicate.
- If ‘last‘, it considers last value as unique and rest of the same values as duplicate.
- If False, it consider all of the same values as duplicates
- inplace: Boolean values, removes rows with duplicates if True.
Return type: DataFrame with removed duplicate rows depending on Arguments passed.
Example:
As we can see one of the TeamA and team has been dropped due to duplicate value.
Python3
import pandas as pd
data = {
"A" : [ "TeamA" , "TeamB" , "TeamB" , "TeamC" , "TeamA" ],
"B" : [ 50 , 40 , 40 , 30 , 50 ],
"C" : [ True , False , False , False , True ]
}
df = pd.DataFrame(data)
display(df.drop_duplicates())
|
Output:
A B C
0 TeamA 50 True
1 TeamB 40 False
3 TeamC 30 False
To download the CSV file used, Click Here.
Example 1: Removing rows with the same First Name
In the following example, rows having the same First Name are removed and a new data frame is returned.
Python3
import pandas as pd
data = pd.read_csv( "employees.csv" )
data.sort_values( "First Name" , inplace = True )
data.drop_duplicates(subset = "First Name" ,
keep = False , inplace = True )
data
|
Output:
As shown in the image, the rows with the same names were removed from a data frame.
Example 2: Removing rows with all duplicate values
In this example, rows having all values will be removed. Since the CSV file isn’t having such a row, a random row is duplicated and inserted into the data frame first.
Python3
length1 = len (data)
data.loc[ 1001 ] = [data[ "First Name" ][ 440 ],
data[ "Gender" ][ 440 ],
data[ "Start Date" ][ 440 ],
data[ "Last Login Time" ][ 440 ],
data[ "Salary" ][ 440 ],
data[ "Bonus %" ][ 440 ],
data[ "Senior Management" ][ 440 ],
data[ "Team" ][ 440 ]]
length2 = len (data)
data.sort_values( "First Name" , inplace = True )
data.drop_duplicates(keep = False , inplace = True )
length3 = len (data)
print (length1, length2, length3)
|
Output:
As shown in the output image, the length after removing duplicates is 999. Since the keep parameter was set to False, all of the duplicate rows were removed.