Identify and Remove Duplicate Data in R

Last Updated : 01 Aug, 2023

A dataset can have duplicate values and to keep it redundancy-free and accurate, duplicate rows need to be identified and removed. In this article, we are going to see how to identify and remove duplicate data in R. First we will check if duplicate data is present in our data, if yes then, we will remove it.

Identifying Duplicate Data in vector

We can use duplicated() function to find out how many duplicates value are present in a vector.

R

# Create a sample vector with duplicate elements
vector_data <- c(1, 2, 3, 4, 4, 5)
 
# Identify duplicate elements
duplicated(vector_data)
 
# count of duplicated data
sum(duplicated(vector_data))

Output:

[1] FALSE FALSE FALSE FALSE  TRUE FALSE

[1] 1

Removing Duplicate Data in a vector

We can remove duplicate data from vectors by using unique() functions so it will give only unique values.

R

# Create a sample vector with duplicate elements
vector_data <- c(1, 2, 3, 4, 4, 5)
 
# Remove duplicate elements
unique(vector_data)

Output:

[1] 1 2 3 4 5

Identifying Duplicate Data in a data frame

For identification, we will use the duplicated() function which returns the count of duplicate rows.

Syntax:

duplicated(dataframe)

Approach:

Create data frame
Pass it to duplicated() function
This function returns the rows which are duplicated in form of boolean values
Apply the sum function to get the number

Data in use:

    name maths science history
1    Ram     7       5       7
2  Geeta     8       7       7
3   John     8       6       7
4   Paul     9       8       7
5 Cassie    10       9       7
6  Geeta     8       7       7
7   Paul     9       8       7

Example:

R

# Creating a sample data frame of students 
# and their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
 
# Printing data
student_result
duplicated(student_result)
sum(duplicated(student_result))

Output:

duplicated(student_result)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

sum(duplicated(student_result))
[1] 2

Removing Duplicate Data in a data frame

Approach

Create data frame
Select rows which are unique
Retrieve those rows
Display result

Method 1: Using unique()

We use unique() to get rows having unique values in our data.

Syntax:

unique(dataframe)

Example:

R

# Creating a sample data frame of students 
# and their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
 
# Printing data
student_result
unique(student_result)

Output:

    name maths science history
1    Ram     7       5       7
2  Geeta     8       7       7
3   John     8       6       7
4   Paul     9       8       7
5 Cassie    10       9       7

Method 2: Using distinct()

Package “tidyverse” should be installed and “dplyr” library should be loaded to use distinct(). We use distinct() to get rows having distinct values in our data.

Syntax:

distinct(dataframe,keepall)

Parameter:

dataframe: data in use

keepall: decides which variables to keep

Example:

R

# load library
library(tidyverse)
 
 
# Creating a sample data frame of students and 
# their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
 
# Printing data
student_result
distinct(student_result)

Output:

    name maths science history
1    Ram     7       5       7
2  Geeta     8       7       7
3   John     8       6       7
4   Paul     9       8       7
5 Cassie    10       9       7

Example 2: Printing unique rows in terms of maths column

R

# Creating a sample data frame of students and
# their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
 
# Printing data
student_result
distinct(student_result,maths,.keep_all = TRUE)

Output:

    name maths science history
1    Ram     7       5       7
2  Geeta     8       7       7
3   Paul     9       8       7
4 Cassie    10       9       7

Suggest improvement

Remove duplicates from a dataframe in PySpark

Share your thoughts in the comments

Identify and Remove Duplicate Data in R

Identifying Duplicate Data in vector

R

Removing Duplicate Data in a vector

R

Identifying Duplicate Data in a data frame

R

Removing Duplicate Data in a data frame

R

R

R

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?