How to Find & Drop duplicate columns in a Pandas DataFrame?

Let’s discuss How to Find & Drop duplicate columns in a Pandas DataFrame. First, Let’s create a simple dataframe with column names ‘Name’, ‘Age’, ‘Domicile’, and ‘Marks’.

filter_none

edit
close

play_arrow

link
brightness_4
code

# Import pandas library 
import pandas as pd
  
# List of Tuples
students = [
            ('Ankit', 34, 'Uttar pradesh', 34),
            ('Riti', 30, 'Delhi', 30),
            ('Aadi', 16, 'Delhi', 16),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Mumbai', 30),
            ('Ankita', 40, 'Bihar', 40),
            ('Sachin', 30, 'Delhi', 30)
         ]
  
# Create a DataFrame object
df = pd.DataFrame(students, columns =['Name', 'Age', 'Domicile', 'Marks'])
  
# Print a original dataframe
df

chevron_right


Output:
Dataframe_1

Code 1: Find duplicate columns in a DataFrame.
To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set. In the end, the function will return the list of column names of the duplicate column.

filter_none

edit
close

play_arrow

link
brightness_4
code

# import pandas library 
import pandas as pd
  
# This function take a dataframe
# as a parameter and returning list
# of column names whose contents 
# are duplicates.
def getDuplicateColumns(df):
  
    # Create an empty set
    duplicateColumnNames = set()
      
    # Iterate through all the columns 
    # of dataframe
    for x in range(df.shape[1]):
          
        # Take column at xth index.
        col = df.iloc[:, x]
          
        # Iterate through all the columns in
        # DataFrame from (x + 1)th index to
        # last index
        for y in range(x + 1, df.shape[1]):
              
            # Take column at yth index.
            otherCol = df.iloc[:, y]
              
            # Check if two columns at x & y
            # index are equal or not,
            # if equal then adding 
            # to the set
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])
                  
    # Return list of unique column names 
    # whose contents are duplicates.
    return list(duplicateColumnNames)
  
# Driver code
if __name__ == "__main__" :
  
    # List of Tuples
    students = [
            ('Ankit', 34, 'Uttar pradesh', 34),
            ('Riti', 30, 'Delhi', 30),
            ('Aadi', 16, 'Delhi', 16),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Mumbai', 30),
            ('Ankita', 40, 'Bihar', 40),
            ('Sachin', 30, 'Delhi', 30)
          ]
  
    # Create a DataFrame object
    df = pd.DataFrame(students, 
                         columns =['Name', 'Age', 'Domicile', 'Marks'])
  
  
    # Get list of duplicate columns
    duplicateColNames = getDuplicateColumns(df)
  
    print('Duplicate Columns are :')
        
    # Iterate through duplicate
    # column names
    for column in duplicateColNames :
       print('Column Name : ', column)

chevron_right


Output:
duplicate column name

Code 2: Drop duplicate columns in a DataFrame.
To remove the duplicate columns we can pass the list of duplicate column’s names returned by our user defines function getDuplicateColumns() to the Dataframe.drop()method.

filter_none

edit
close

play_arrow

link
brightness_4
code

# import pandas library 
import pandas as pd
  
  
# This function take a dataframe
# as a parameter and returning list
# of column names whose contents 
# are duplicates.
def getDuplicateColumns(df):
  
    # Create an empty set
    duplicateColumnNames = set()
      
    # Iterate through all the columns 
    # of dataframe
    for x in range(df.shape[1]):
          
        # Take column at xth index.
        col = df.iloc[:, x]
          
        # Iterate through all the columns in
        # DataFrame from (x + 1)th index to
        # last index
        for y in range(x + 1, df.shape[1]):
              
            # Take column at yth index.
            otherCol = df.iloc[:, y]
              
            # Check if two columns at x & y
            # index are equal or not,
            # if equal then adding 
            # to the set
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])
                  
    # Return list of unique column names 
    # whose contents are duplicates.
    return list(duplicateColumnNames)
  
# Driver code
if __name__ == "__main__" :
  
    # List of Tuples
    students = [
            ('Ankit', 34, 'Uttar pradesh', 34),
            ('Riti', 30, 'Delhi', 30),
            ('Aadi', 16, 'Delhi', 16),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Delhi', 30),
            ('Riti', 30, 'Mumbai', 30),
            ('Ankita', 40, 'Bihar', 40),
            ('Sachin', 30, 'Delhi', 30)
          ]
  
    # Create a DataFrame object
    df = pd.DataFrame(students, 
                        columns =['Name', 'Age', 'Domicile', 'Marks'])
  
    # Dropping duplicate columns
    rslt_df = df.drop(columns = getDuplicateColumns(df))
  
    print("Resultant Dataframe :")
  
    # Show the dataframe
    rslt_df

chevron_right


Output:

Dataframe




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.