Remove duplicate rows based on multiple columns using Dplyr in R
Last Updated :
28 Jul, 2021
In this article, we will learn how to remove duplicate rows based on multiple columns using dplyr in R programming language.
Dataframe in use:
lang value usage
1 Java 21 21
2 C 21 21
3 Python 3 0
4 GO 5 99
5 RUST 180 44
6 Javascript 9 48
7 Cpp 12 53
8 Java 21 21
9 Julia 6 6
10 Typescript 0 8
11 Python 3 0
12 GO 6 6
Removing duplicate rows based on the Single Column
distinct() function can be used to filter out the duplicate rows. We just have to pass our R object and the column name as an argument in the distinct() function.
Note: We have used this parameter “.keep_all= TRUE” in the function because by default its FALSE, and it will print only the distinct values of the specified column, but we want all the columns so we have to make it TRUE, such that it will print all the other columns along with the current column.
Syntax: distinct(df, column_name, .keep_all= TRUE)
Parameters:
df: dataframe object
column_name: column name based on which duplicate rows will be removed
Example: R program to remove duplicate rows based on single column
R
library (dplyr)
df <- data.frame (lang = c ( 'Java' , 'C' , 'Python' , 'GO' , 'RUST' , 'Javascript' ,
'Cpp' , 'Java' , 'Julia' , 'Typescript' , 'Python' , 'GO' ),
value = c (21,21,3,5,180,9,12,21,6,0,3,6),
usage = c (21,21,0,99,44,48,53,21,6,8,0,6))
distinct (df, lang, .keep_all= TRUE )
|
Output:
lang value usage
1 Java 21 21
2 C 21 21
3 Python 3 0
4 GO 5 99
5 RUST 180 44
6 Javascript 9 48
7 Cpp 12 53
8 Julia 6 6
9 Typescript 0 8
Removing duplicate rows based on Multiple columns
We can remove duplicate values on the basis of ‘value‘ & ‘usage‘ columns, bypassing those column names as an argument in the distinct function.
Syntax: distinct(df, col1,col2, .keep_all= TRUE)
Parameters:
df: dataframe object
col1,col2: column name based on which duplicate rows will be removed
Example: R program to remove duplicate rows based on multiple columns
R
library (dplyr)
df <- data.frame (lang = c ( 'Java' , 'C' , 'Python' , 'GO' , 'RUST' , 'Javascript' ,
'Cpp' , 'Java' , 'Julia' , 'Typescript' , 'Python' , 'GO' ),
value = c (21,21,3,5,180,9,12,21,6,0,3,6),
usage = c (21,21,0,99,44,48,53,21,6,8,0,6))
distinct (df, value, usage, .keep_all= TRUE )
|
Output:
lang value usage
1 Java 21 21
2 Python 3 0
3 GO 5 99
4 RUST 180 44
5 Javascript 9 48
6 Cpp 12 53
7 Julia 6 6
8 Typescript 0 8
Remove all the duplicate rows from the dataframe
In this case, we just have to pass the entire dataframe as an argument in distinct() function, it then checks for all the duplicate rows for all variables/columns and removes them.
Syntax: distinct(df)
Parameters:
df: dataframe object
Example: R program to remove all the duplicate rows from the database
R
library (dplyr)
df <- data.frame (lang = c ( 'Java' , 'C' , 'Python' , 'GO' , 'RUST' , 'Javascript' ,
'Cpp' , 'Java' , 'Julia' , 'Typescript' , 'Python' , 'GO' ),
value = c (21,21,3,5,180,9,12,21,6,0,3,6),
usage = c (21,21,0,99,44,48,53,21,6,8,0,6))
distinct (df)
|
Output:
lang value usage
1 Java 21 21
2 C 21 21
3 Python 3 0
4 GO 5 99
5 RUST 180 44
6 Javascript 9 48
7 Cpp 12 53
8 Julia 6 6
9 Typescript 0 8
10 GO 6 6
Using duplicated() function
In this approach, we have used duplicated() to remove all the duplicate rows, here duplicated function is used to check for the duplicate rows, then the column names/variables are passed in the duplicated function.
Note: We have used the NOT(!) operator because we want to filter out or remove the duplicate rows since the duplicated function provides the duplicate rows we negate them using ‘!‘ operator.
Syntax:
df %>%
filter(!duplicated(cbind(col1, col2,..)))
Parameters:
col1,col2: Pass the names of columns based on which you want to remove duplicated values
cbind():It is used to bind together column names such that multiple column names can be used for filtering
duplicated(): returns the duplicate rows
Example: R program to remove duplicate using duplicate()
R
library (dplyr)
df <- data.frame (lang = c ( 'Java' , 'C' , 'Python' , 'GO' , 'RUST' , 'Javascript' ,
'Cpp' , 'Java' , 'Julia' , 'Typescript' , 'Python' , 'GO' ),
value = c (21,21,3,5,180,9,12,21,6,0,3,6),
usage = c (21,21,0,99,44,48,53,21,6,8,0,6))
df %>%
filter (! duplicated ( cbind (value, usage)))
|
Output:
lang value usage
1 Java 21 21
2 Python 3 0
3 GO 5 99
4 RUST 180 44
5 Javascript 9 48
6 Cpp 12 53
7 Julia 6 6
8 Typescript 0 8
Share your thoughts in the comments
Please Login to comment...