Data Duplication Removal from Dataset Using Python

Last Updated : 27 Mar, 2024

Duplicate data is a common issue in datasets that can lead to inaccuracies and bias in analysis. Removing duplicates is an essential step in data cleaning and preprocessing, ensuring that the data is accurate and reliable for further analysis or modeling. In this article, we’ll explore how to identify and remove duplicates from a dataset using Python.

Data Duplication Removal from Dataset Using Python

Below is the step-by-step procedure by which we can remove duplicate data from a dataset using Python:

Step 1: Generating a Sample Dataset with Duplicates

Here, let’s generate a sample dataset with duplicates for demonstration purposes. We’ll use Python’s Pandas library to create a DataFrame with duplicate records.

Python3

import pandas as pd
import numpy as np

# Create a DataFrame with duplicate records
data = {
    'ID': [1, 2, 3, 4, 5, 1, 6, 2, 7],
    'Name': ['John', 'Alice', 'Bob', 'Charlie', 'Emma', 'John', 'Eva', 'Alice', 'David'],
    'Age': [25, 30, 35, 40, 45, 25, 50, 30, 55]
}

df = pd.DataFrame(data)
print(df)

Output

   ID     Name  Age
0   1     John   25
1   2    Alice   30
2   3      Bob   35
3   4  Charlie   40
4   5     Emma   45
5   1     John   25
6   6      Eva   50
7   2    Alice   30
8   7    David   55

Step 2: Identifying Duplicates

In this step, after loading your dataset into a DataFrame, we will identify duplicate records. We can achieve this using pandas duplicated() function, which returns a boolean series indicating whether each row is a duplicate of a previous row.

Python3

# Identify duplicate records
duplicate_mask = df.duplicated()
duplicates = df[duplicate_mask]
print(duplicates)

Output:

   ID   Name  Age
5   1   John   25
7   2  Alice   30

Step 3: Removing Duplicates

Once duplicates are identified, we can remove them from the dataset using the drop_duplicates() function in pandas. By default, this function keeps the first occurrence of each duplicated record and removes subsequent occurrences.

Python3

# Remove duplicates
data = df.drop_duplicates()

Output:

    ID    Name    Age
0    1    John    25
1    2    Alice    30
2    3    Bob    35
3    4    Charlie    40
4    5    Emma    45
6    6    Eva    50
8    7    David    55

Step 4: Saving the Cleaned Dataset

Finally, we can save the cleaned dataset without duplicates to a new file using pandas’ to_csv() function.

Python3

# Save the cleaned dataset to a new CSV file
deduplicated_df.to_csv('cleaned_dataset.csv', index=False)

Output:

cleaned_dataset.csv

    ID    Name    Age
0    1    John    25
1    2    Alice    30
2    3    Bob    35
3    4    Charlie    40
4    5    Emma    45
6    6    Eva    50
8    7    David    55

Suggest improvement

Remove Duplicate Strings from a List in Python

Share your thoughts in the comments

Data Duplication Removal from Dataset Using Python