Open In App

R Dplyr Distinct() Function

Last Updated : 19 Jan, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

The dplyr package in R Programming Language offers a powerful tool, the distinct() function, designed to identify and eliminate duplicate rows in a data frame. This article describes the syntax and advantages of distinct().

We’ll look at some examples using an employee data frame to understand the various ways in which distinct() can be used for data analysis.

What is a distinct() Function?

The distinct() the function is a data manipulation function provided by the dplyr package in R. Its primary purpose is to identify and return unique rows or distinct combinations of values within a data frame based on specified columns. This function is particularly useful for data cleaning, exploratory data analysis, and obtaining unique records from a dataset.

  1. It is important to remove duplicate rows as data duplication can be caused due to errors in data entry, merging data from different sources, inconsistent naming conventions, or data scraping issues.
  2. Duplicate data in a dataset can lead to biased results.

For example, the values of the mean, median, and mode can be skewed and can cause misleading interpretations of the data. Duplicate customer data can lead to the same customer being contacted many times for the same information.

Syntax

distinct(.data, ..., .keep_all = FALSE).
data : A data frame Any optional variables you want to use to determine uniqueness, variables should be
specified using their column names.
keep_all : If the value is set to TRUE, keep all the variables in data.

To use the dplyr package for our analysis, we will first need to install it by using the following command,

#installing dplyr package
install.packages("dplyr")

Removing Duplicate Rows from the Dataframe

In this example, we remove duplicate rows from the dataframe employee and save it in a new dataframe employee_unique. In this example, we have passed only the dataframe as an argument to the distinct() function, therefore all the unique rows will be returned.

In our original dataframe, row number 2 and row number 6 are duplicates, distinct() will retain only the first row and discard the duplicate row.

R




#loading the package
library(dplyr)
 
#creating a dataframe with duplicate rows
employee= data.frame(first_name=c("Ram","Sara","John","Fred",
                                  "Kat","Sara","Ram","Riya"),
                     last_name=c("Singh","Gupta","Adams","Roy",
                                 "Amit","Gupta","Misra","Gupta"),
                     role=c("Manager","Analyst","Analyst","CEO","Intern",
                            "Analyst","Intern","Intern"))
 
#using the distinct() function to remove duplicate rows
employee_unique= distinct(employee)
print(employee_unique)


Output:

  first_name last_name    role
1 Ram Singh Manager
2 Sara Gupta Analyst
3 John Adams Analyst
4 Fred Roy CEO
5 Kat Amit Intern
6 Ram Misra Intern
7 Riya Gupta Intern

Removing Duplicate Values from a Column

We can also remove duplicate values by specifying one or more column names. In this case the output dataframe will only contain the columns mentioned in distinct(). Since we have passed first_name as an argument along with the dataframe name, unique values of only first_name variable will be returned.

R




#loading the package
library(dplyr)
 
#creating a dataframe with duplicate rows
employee= data.frame(first_name=c("Ram","Sara","John","Fred",
                                  "Kat","Sara","Ram","Riya"),
                     last_name=c("Singh","Gupta","Adams","Roy",
                                 "Amit","Gupta","Misra","Gupta"),
                     role=c("Manager","Analyst","Analyst","CEO","Intern",
                            "Analyst","Intern","Intern"))
 
#using the distinct() function to remove duplicate values from a column
employee_unique= distinct(employee, first_name)
print(employee_unique)


Output:

  first_name
1 Ram
2 Sara
3 John
4 Fred
5 Kat
6 Riya

Removing Duplicate Values from a Column and Displaying all Variables

In this example, distinct() is used to find records where the values of first_name variable are unique. To include all values corresponding to the unique first_name values, we should set the value of .keep_all= TRUE. The default value for .keep_all is FALSE.

R




#loading the package
library(dplyr)
 
#creating a dataframe with duplicate rows
employee= data.frame(first_name=c("Ram","Sara","John","Fred",
                                  "Kat","Sara","Ram","Riya"),
                     last_name=c("Singh","Gupta","Adams","Roy",
                                 "Amit","Gupta","Misra","Gupta"),
                     role=c("Manager","Analyst","Analyst","CEO","Intern",
                            "Analyst","Intern","Intern"))
 
#using the distinct() function to remove duplicate values from a
#column and displaying all variables
employee_unique= distinct(employee,first_name,.keep_all = TRUE)
print(employee_unique)


Output:

  first_name last_name    role
1 Ram Singh Manager
2 Sara Gupta Analyst
3 John Adams Analyst
4 Fred Roy CEO
5 Kat Amit Intern
6 Riya Gupta Intern

Distinct() vs Unique()

distinct() does the same thing as unique(), a base R function, but is comparatively faster. distinct() can be used to return the unique elements of only a specific set of columns in a dataframe, by passing the column names as arguments. To do the same thing using unique(), we will need to create a temporary dataframe and find the distinct rows from that. This process can be slow for huge dataframes.

For example, in order to get unique combinations of first_name and last_name, we will have to create a dataframe containing only these two variables and then find unique combination of rows from the new dataframe.

unique() maintains the row numbers, we can see from the output that row number 6 has been removed, which means that it was a duplicate row.

R




#creating a dataframe with duplicate rows
employee= data.frame(first_name=c("Ram","Sara","John","Fred",
                                  "Kat","Sara","Ram","Riya"),
                     last_name=c("Singh","Gupta","Adams","Roy",
                                 "Amit","Gupta","Misra","Gupta"),
                     role=c("Manager","Analyst","Analyst","CEO","Intern",
                            "Analyst","Intern","Intern"))
 
#printing unique rows for a combination of columns using unique()
employee_unique=unique(employee[c("first_name", "last_name")])
print(employee_unique)


Output:

  first_name last_name
1 Ram Singh
2 Sara Gupta
3 John Adams
4 Fred Roy
5 Kat Amit
7 Ram Misra
8 Riya Gupta

Conclusion

In conclusion, we can say that the distinct() function from the dplyr package is a valuable tool for data analysis. It efficiently removes duplicate values and thus leads to better analysis of data.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads