How to Find & Drop duplicate columns in a Pandas DataFrame?
Let’s discuss How to Find & Drop duplicate columns in a Pandas DataFrame. First, Let’s create a simple Dataframe with column names ‘Name’, ‘Age’, ‘Domicile’, and ‘Age’/’Marks’.
Find duplicate columns from a DataFrame
To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set. In the end, the function will return the list of column names of the duplicate column.
Python3
import pandas as pd def getDuplicateColumns(df): # Create an empty set duplicateColumnNames = set () # Iterate through all the columns # of dataframe for x in range (df.shape[ 1 ]): # Take column at xth index. col = df.iloc[:, x] # Iterate through all the columns in # DataFrame from (x + 1)th index to # last index for y in range (x + 1 , df.shape[ 1 ]): # Take column at yth index. otherCol = df.iloc[:, y] # Check if two columns at x & y # index are equal or not, # if equal then adding # to the set if col.equals(otherCol): duplicateColumnNames.add(df.columns.values[y]) # Return list of unique column names # whose contents are duplicates. return list (duplicateColumnNames) # Driver code if __name__ = = "__main__" : # List of Tuples students = [ ( 'Ankit' , 34 , 'Uttar pradesh' , 34 ), ( 'Riti' , 30 , 'Delhi' , 30 ), ( 'Aadi' , 16 , 'Delhi' , 16 ), ( 'Riti' , 30 , 'Delhi' , 30 ), ( 'Riti' , 30 , 'Delhi' , 30 ), ( 'Riti' , 30 , 'Mumbai' , 30 ), ( 'Ankita' , 40 , 'Bihar' , 40 ), ( 'Sachin' , 30 , 'Delhi' , 30 ) ] # Create a DataFrame object df = pd.DataFrame(students, columns = [ 'Name' , 'Age' , 'Domicile' , 'Marks' ]) # Get list of duplicate columns duplicateColNames = getDuplicateColumns(df) for column in duplicateColNames: print ( 'Column Name : ' , column) |
Output:
Column Name: Marks
Remove duplicate columns from a DataFrame
Method 1: Drop duplicate columns from a DataFrame using drop_duplicates()
Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.
Python3
# Drop duplicate columns df2 = df.T.drop_duplicates().T print (df2) |
Output:
Name Age Domicile 0 Ankit 34 Uttar pradesh 1 Riti 30 Delhi 2 Aadi 16 Delhi 3 Riti 30 Mumbai 4 Ankita 40 Bihar 5 Sachin 30 Delhi
Method 2: Remove duplicate columns from a DataFrame using df.loc[]
Pandas df.loc[] attribute access a group of rows and columns by label(s) or a boolean array in the given DataFrame.
Python3
# Remove duplicate columns pandas DataFrame df2 = df.loc[:,~df.columns.duplicated()] print (df2) |
Output:
Name Age Domicile 0 Ankit 34 Uttar pradesh 1 Riti 30 Delhi 2 Aadi 16 Delhi 3 Riti 30 Mumbai 4 Ankita 40 Bihar 5 Sachin 30 Delhi
Method 3: Remove duplicate columns from a DataFrame using df.columns.duplicated()
Pandas df.duplicated() method helps in analyzing duplicate values only. It returns a boolean series which is True only for Unique elements.
Python3
# Use DataFrame.columns.duplicated() to drop duplicate columns duplicate_cols = df.columns[df.columns.duplicated()] df.drop(columns = duplicate_cols, inplace = True ) print (df) |
Output:
Name Domicile 0 Ankit Uttar pradesh 1 Riti Delhi 2 Aadi Delhi 3 Riti Mumbai 4 Ankita Bihar 5 Sachin Delhi
Method 4: Drop duplicate columns in a DataFrame using df.drop
To remove the duplicate columns we can pass the list of duplicate column names returned by our user defines function getDuplicateColumns() to the Dataframe.drop() method.
Python3
# import pandas library import pandas as pd def getDuplicateColumns(df): # Create an empty set duplicateColumnNames = set () # Iterate through all the columns # of dataframe for x in range (df.shape[ 1 ]): # Take column at xth index. col = df.iloc[:, x] # Iterate through all the columns in # DataFrame from (x + 1)th index to # last index for y in range (x + 1 , df.shape[ 1 ]): # Take column at yth index. otherCol = df.iloc[:, y] # Check if two columns at x & y # index are equal or not, # if equal then adding # to the set if col.equals(otherCol): duplicateColumnNames.add(df.columns.values[y]) # Return list of unique column names # whose contents are duplicates. return list (duplicateColumnNames) # Driver code if __name__ = = "__main__" : # List of Tuples students = [ ( 'Ankit' , 34 , 'Uttar pradesh' , 34 ), ( 'Riti' , 30 , 'Delhi' , 30 ), ( 'Aadi' , 16 , 'Delhi' , 16 ), ( 'Riti' , 30 , 'Delhi' , 30 ), ( 'Riti' , 30 , 'Delhi' , 30 ), ( 'Riti' , 30 , 'Mumbai' , 30 ), ( 'Ankita' , 40 , 'Bihar' , 40 ), ( 'Sachin' , 30 , 'Delhi' , 30 ) ] # Create a DataFrame object df = pd.DataFrame(students, columns = [ 'Name' , 'Age' , 'Domicile' , 'Marks' ]) # Dropping duplicate columns rslt_df = df.drop(columns = getDuplicateColumns(df)) print ( "Resultant Dataframe :" ) # Show the dataframe rslt_df |
Output:

Please Login to comment...