Open In App

Data Duplication Removal from Dataset Using Python

Last Updated : 27 Mar, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Duplicate data is a common issue in datasets that can lead to inaccuracies and bias in analysis. Removing duplicates is an essential step in data cleaning and preprocessing, ensuring that the data is accurate and reliable for further analysis or modeling. In this article, we’ll explore how to identify and remove duplicates from a dataset using Python.

Data Duplication Removal from Dataset Using Python

Below is the step-by-step procedure by which we can remove duplicate data from a dataset using Python:

Step 1: Generating a Sample Dataset with Duplicates

Here, let’s generate a sample dataset with duplicates for demonstration purposes. We’ll use Python’s Pandas library to create a DataFrame with duplicate records.

Python3
import pandas as pd
import numpy as np

# Create a DataFrame with duplicate records
data = {
    'ID': [1, 2, 3, 4, 5, 1, 6, 2, 7],
    'Name': ['John', 'Alice', 'Bob', 'Charlie', 'Emma', 'John', 'Eva', 'Alice', 'David'],
    'Age': [25, 30, 35, 40, 45, 25, 50, 30, 55]
}

df = pd.DataFrame(data)
print(df)

Output
   ID     Name  Age
0   1     John   25
1   2    Alice   30
2   3      Bob   35
3   4  Charlie   40
4   5     Emma   45
5   1     John   25
6   6      Eva   50
7   2    Alice   30
8   7    David   55

Step 2: Identifying Duplicates

In this step, after loading your dataset into a DataFrame, we will identify duplicate records. We can achieve this using pandas duplicated() function, which returns a boolean series indicating whether each row is a duplicate of a previous row.

Python3
# Identify duplicate records
duplicate_mask = df.duplicated()
duplicates = df[duplicate_mask]
print(duplicates)

Output:

   ID   Name  Age
5 1 John 25
7 2 Alice 30

Step 3: Removing Duplicates

Once duplicates are identified, we can remove them from the dataset using the drop_duplicates() function in pandas. By default, this function keeps the first occurrence of each duplicated record and removes subsequent occurrences.

Python3
# Remove duplicates
data = df.drop_duplicates()

Output:

    ID    Name    Age
0 1 John 25
1 2 Alice 30
2 3 Bob 35
3 4 Charlie 40
4 5 Emma 45
6 6 Eva 50
8 7 David 55

Step 4: Saving the Cleaned Dataset

Finally, we can save the cleaned dataset without duplicates to a new file using pandas’ to_csv() function.

Python3
# Save the cleaned dataset to a new CSV file
deduplicated_df.to_csv('cleaned_dataset.csv', index=False)

Output:

cleaned_dataset.csv

    ID    Name    Age
0 1 John 25
1 2 Alice 30
2 3 Bob 35
3 4 Charlie 40
4 5 Emma 45
6 6 Eva 50
8 7 David 55

Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads