Skip to content
Related Articles

Related Articles

Improve Article
Identify and Remove Duplicate Data in R
  • Last Updated : 26 Mar, 2021

A dataset can have duplicate values and to keep it redundancy-free and accurate, duplicate rows need to be identified and removed. In this article, we are going to see how to identify and remove duplicate data in R. First we will check if duplicate data is present in our data, if yes then, we will remove it.

Data in use:

Identifying Duplicate Data

For identification, we will use duplicated() function which returns the count of duplicate rows.

Syntax:



duplicated(dataframe)

Approach:

  • Create data frame
  • Pass it to duplicated() function
  • This function returns the rows which are duplicated in forms of boolean values
  • Apply sum function to get the number

Example:

R




# Creating a sample data frame of students 
# and their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
  
# Printing data
student_result
duplicated(student_result)
sum(duplicated(student_result))

Output:

> duplicated(student_result)

[1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

> sum(duplicated(student_result))



[1] 2

Removing Duplicate Data

Approach

  • Create data frame
  • Select rows which are unique
  • Retrieve those rows
  • Display result

Method 1: Using unique()

We use unique() to get rows having unique values in our data.

Syntax:

unique(dataframe)

Example:

R




# Creating a sample data frame of students 
# and their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
  
# Printing data
student_result
unique(student_result)

Output:



Method 2: Using distinct()

 Package “tidyverse” should be installed and  “dplyr” library should be loaded to use distinct(). We use distinct() to get rows having distinct values in our data.

Syntax:

distinct(dataframe,keepall)

Parameter:

  • dataframe: data in use
  • keepall: decides which variables to keep

Example:

R




# Creating a sample data frame of students and 
# their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
  
# Printing data
student_result
distinct(student_result)

Output:

Example 2: Printing unique rows in terms of maths column

R




# Creating a sample data frame of students and
# their marks in respective subjects.
student_result=data.frame(name=c("Ram","Geeta","John","Paul",
                                 "Cassie","Geeta","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
  
# Printing data
student_result
distinct(student_result,maths,.keep_all = TRUE)

Output:

My Personal Notes arrow_drop_up
Recommended Articles
Page :