dplyr Package in R Programming

Last Updated : 20 Dec, 2023

In this article, we will discuss Aggregating and analyzing data with dplyr package in the R Programming Language.

dplyr Package in R

The dplyr package in R Programming Language is a structure of data manipulation that provides a uniform set of verbs, helping to resolve the most frequent data manipulation hurdles.

By limiting the choices the focus can now be more on data manipulation difficulties.
There are uncomplicated “verbs”, functions present for tackling every common data manipulation and the thoughts can be translated into code faster.
There are valuable backends and hence waiting time for the computer is reduced.

Here are some key functions and concepts within the dplyr package in R.

Data Frame and Tibble

Data frames in dplyr in R is organized tables where each column stores specific types of information, like names, ages, or scores.for creating a data frame involves specifying column names and their respective values.

R

df <- data.frame(
  Name = c("vipul", "jayesh", "anurag"),
  Age = c(25, 23, 22),
  Score = c(95, 89, 78)
)
df

Output:

    Name Age Score
1  vipul  25    95
2 jayesh  23    89
3 anurag  22    78

On the other hand, tibbles, introduced through the tibble package, share similar functionality but offer enhanced user-friendly features. The syntax for creating a tibble is comparable to that of a data frame.

Pipes (`%>%`)

dplyr in R The pipe operator (%>%) in dplyr package, which allows us to chain multiple operations together, improving code readability.

R

# Load necessary libraries
library(dplyr)
 
# Example: Chain operations using the pipe operator
result <- mtcars %>%
  filter(mpg > 20) %>%        # Filter rows where mpg is greater than 20
  select(mpg, cyl, hp) %>%    # Select specific columns
  group_by(cyl) %>%           # Group the data by the 'cyl' variable
  summarise(mean_hp = mean(hp))  # Calculate the mean horsepower for each group
 
# Display the result
print(result)

Output:

    cyl mean_hp
  <dbl>   <dbl>
1     4    82.6
2     6   110

Verb Functions

dplyr in R provides various important functions that can be used for Data Manipulation. These are:

filter() Function

For choosing cases and using their values as a base for doing so.

R

# Create a data frame with missing data
d <- data.frame(name = c("Abhi", "Bhavesh", "Chaman", "Dimri"),
                age = c(7, 5, 9, 16),
                ht = c(46, NA, NA, 69),
                school = c("yes", "yes", "no", "no"))
 
# Display the data frame
print(d)
 
# Finding rows with NA value
rows_with_na <- d %>% filter(is.na(ht))
print(rows_with_na)
 
# Finding rows with no NA value
rows_without_na <- d %>% filter(!is.na(ht))
print(rows_without_na)

Output:

     name age ht school
1    Abhi   7 46    yes
2 Bhavesh   5 NA    yes
3  Chaman   9 NA     no
4   Dimri  16 69     no
Finding rows with NA value
     name age ht school
1 Bhavesh   5 NA    yes
2  Chaman   9 NA     no
Finding rows with no NA value
   name age ht school
1  Abhi   7 46    yes
2 Dimri  16 69     no

arrange():

For reordering of the cases.

R

# Create a data frame with missing data 
d <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"), 
                 age = c(7, 5, 9, 16), 
                 ht = c(46, NA, NA, 69),
                 school = c("yes", "yes", "no", "no") )
d
 
# Arranging name according to the age
d.name<- arrange(d, age)
print(d.name)

Output:

     name age ht school
1    Abhi   7 46    yes
2 Bhavesh   5 NA    yes
3  Chaman   9 NA     no
4   Dimri  16 69     no
 
Arranging name according to the age
     name age ht school
1 Bhavesh   5 NA    yes
2    Abhi   7 46    yes
3  Chaman   9 NA     no
4   Dimri  16 69     no

select() and rename():

For choosing variables and using their names as a base for doing so.

R

# Create a data frame with missing data
d <- data.frame(name=c("Abhi", "Bhavesh",
                        "Chaman", "Dimri"),
                 age=c(7, 5, 9, 16),
                 ht=c(46, NA, NA, 69),
                 school=c("yes", "yes", "no", "no"))
 
# startswith() function to print only ht data
select(d, starts_with("ht"))
 
# -startswith() function to print
# everything except ht data
select(d, -starts_with("ht"))
 
# Printing column 1 to 2
select(d, 1: 2)
 
# Printing data of column
# heading containing 'a'
select(d, contains("a"))
 
# Printing data of column
# heading which matches 'na'
select(d, matches("na"))

Output:


  ht
1 46
2 NA
3 NA
4 69
everything except ht data
     name age school
1    Abhi   7    yes
2 Bhavesh   5    yes
3  Chaman   9     no
4   Dimri  16     no
Printing column 1 to 2
     name age
1    Abhi   7
2 Bhavesh   5
3  Chaman   9
4   Dimri  16
heading containing 'a'
     name age
1    Abhi   7
2 Bhavesh   5
3  Chaman   9
4   Dimri  16
heading which matches 'na'
     name
1    Abhi
2 Bhavesh
3  Chaman
4   Dimri

mutate() and transmute():

Addition of new variables which are the functions of prevailing variables.

R

# Create a data frame with missing data 
d <- data.frame( name = c("Abhi", "Bhavesh", 
                          "Chaman", "Dimri"), 
                 age = c(7, 5, 9, 16), 
                 ht = c(46, NA, NA, 69),
                 school = c("yes", "yes", "no", "no") )
 
# Calculating a variable x3 which is sum of height
# and age printing with ht and age
mutate(d, x3 = ht + age) 
 
# Calculating a variable x3 which is sum of height 
# and age printing without ht and age
transmute(d, x3 = ht + age) 

Output:

     name age ht school
1    Abhi   7 46    yes
2 Bhavesh   5 NA    yes
3  Chaman   9 NA     no
4   Dimri  16 69     no
Calculating a variable x3 which is sum of height
 
     name age ht school x3
1    Abhi   7 46    yes 53
2 Bhavesh   5 NA    yes NA
3  Chaman   9 NA     no NA
4   Dimri  16 69     no 85
Calculating a variable x3 which is sum of height 
  x3
1 53
2 NA
3 NA
4 85

summarise():

Condensing various values to one value.

R

# Create a data frame with missing data 
d <- data.frame( name = c("Abhi", "Bhavesh",
                          "Chaman", "Dimri"), 
                 age = c(7, 5, 9, 16), 
                 ht = c(46, NA, NA, 69),
                 school = c("yes", "yes", "no", "no") )
 
# Calculating mean of age
summarise(d, mean = mean(age))
 
# Calculating min of age
summarise(d, med = min(age))
 
# Calculating max of age
summarise(d, med = max(age))
 
# Calculating median of age
summarise(d, med = median(age))

Output:

Calculating mean of age
  mean
1 9.25
Calculating minimum age
  med
1   5
Calculating max of age
  med
1  16
Calculating median of age
  med
1   8

sample_n() and sample_frac():

For taking random specimens.

R

# Create a data frame with missing data 
d <- data.frame( name = c("Abhi", "Bhavesh",
                          "Chaman", "Dimri"), 
                 age = c(7, 5, 9, 16), 
                 ht = c(46, NA, NA, 69),
                 school = c("yes", "yes", "no", "no") )
 
# Printing three rows
sample_n(d, 3)
 
# Printing 50 % of the rows
sample_frac(d, 0.50)

Output:

    name age ht school
1 Chaman   9 NA     no
2  Dimri  16 69     no
3   Abhi   7 46    yes
 Printing 50 % of the rows
   name age ht school
1  Abhi   7 46    yes
2 Dimri  16 69     no

Suggest improvement

Data visualization with R and ggplot2

Grid and Lattice Packages in R Programming

Share your thoughts in the comments

Introduction

Fundamentals of R

Variables

Input/Output

Control Flow

Functions

Data Structures

Object Oriented Programming

Error Handling

File Handling

Packages in R

Data Interfaces

Data Visualization

Statistics

Machine Learning

dplyr Package in R Programming