Open In App

dplyr Package in R Programming

Last Updated : 20 Dec, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will discuss Aggregating and analyzing data with dplyr package in the R Programming Language.

dplyr Package in R

The dplyr package in R Programming Language is a structure of data manipulation that provides a uniform set of verbs, helping to resolve the most frequent data manipulation hurdles.

  • By limiting the choices the focus can now be more on data manipulation difficulties.
  • There are uncomplicated “verbs”, functions present for tackling every common data manipulation and the thoughts can be translated into code faster.
  • There are valuable backends and hence waiting time for the computer is reduced.

Here are some key functions and concepts within the dplyr package in R.

Data Frame and Tibble

Data frames in dplyr in R is organized tables where each column stores specific types of information, like names, ages, or scores.for creating a data frame involves specifying column names and their respective values.

R




df <- data.frame(
  Name = c("vipul", "jayesh", "anurag"),
  Age = c(25, 23, 22),
  Score = c(95, 89, 78)
)
df


Output:

    Name Age Score
1 vipul 25 95
2 jayesh 23 89
3 anurag 22 78

On the other hand, tibbles, introduced through the tibble package, share similar functionality but offer enhanced user-friendly features. The syntax for creating a tibble is comparable to that of a data frame.

Pipes (%>%)

dplyr in R The pipe operator (%>%) in dplyr package, which allows us to chain multiple operations together, improving code readability.

R




# Load necessary libraries
library(dplyr)
 
# Example: Chain operations using the pipe operator
result <- mtcars %>%
  filter(mpg > 20) %>%        # Filter rows where mpg is greater than 20
  select(mpg, cyl, hp) %>%    # Select specific columns
  group_by(cyl) %>%           # Group the data by the 'cyl' variable
  summarise(mean_hp = mean(hp))  # Calculate the mean horsepower for each group
 
# Display the result
print(result)


Output:

    cyl mean_hp
<dbl> <dbl>
1 4 82.6
2 6 110

Verb Functions

dplyr in R provides various important functions that can be used for Data Manipulation. These are: 

filter() Function

For choosing cases and using their values as a base for doing so.

R




# Create a data frame with missing data
d <- data.frame(name = c("Abhi", "Bhavesh", "Chaman", "Dimri"),
                age = c(7, 5, 9, 16),
                ht = c(46, NA, NA, 69),
                school = c("yes", "yes", "no", "no"))
 
# Display the data frame
print(d)
 
# Finding rows with NA value
rows_with_na <- d %>% filter(is.na(ht))
print(rows_with_na)
 
# Finding rows with no NA value
rows_without_na <- d %>% filter(!is.na(ht))
print(rows_without_na)


Output: 

     name age ht school
1 Abhi 7 46 yes
2 Bhavesh 5 NA yes
3 Chaman 9 NA no
4 Dimri 16 69 no
Finding rows with NA value
name age ht school
1 Bhavesh 5 NA yes
2 Chaman 9 NA no
Finding rows with no NA value
name age ht school
1 Abhi 7 46 yes
2 Dimri 16 69 no

arrange():

For reordering of the cases.

R




# Create a data frame with missing data
d <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"),
                 age = c(7, 5, 9, 16),
                 ht = c(46, NA, NA, 69),
                 school = c("yes", "yes", "no", "no") )
d
 
# Arranging name according to the age
d.name<- arrange(d, age)
print(d.name)


Output: 

     name age ht school
1 Abhi 7 46 yes
2 Bhavesh 5 NA yes
3 Chaman 9 NA no
4 Dimri 16 69 no

Arranging name according to the age
name age ht school
1 Bhavesh 5 NA yes
2 Abhi 7 46 yes
3 Chaman 9 NA no
4 Dimri 16 69 no

select() and rename():

For choosing variables and using their names as a base for doing so.

R




# Create a data frame with missing data
d <- data.frame(name=c("Abhi", "Bhavesh",
                        "Chaman", "Dimri"),
                 age=c(7, 5, 9, 16),
                 ht=c(46, NA, NA, 69),
                 school=c("yes", "yes", "no", "no"))
 
# startswith() function to print only ht data
select(d, starts_with("ht"))
 
# -startswith() function to print
# everything except ht data
select(d, -starts_with("ht"))
 
# Printing column 1 to 2
select(d, 1: 2)
 
# Printing data of column
# heading containing 'a'
select(d, contains("a"))
 
# Printing data of column
# heading which matches 'na'
select(d, matches("na"))


Output: 


ht
1 46
2 NA
3 NA
4 69
everything except ht data
name age school
1 Abhi 7 yes
2 Bhavesh 5 yes
3 Chaman 9 no
4 Dimri 16 no
Printing column 1 to 2
name age
1 Abhi 7
2 Bhavesh 5
3 Chaman 9
4 Dimri 16
heading containing 'a'
name age
1 Abhi 7
2 Bhavesh 5
3 Chaman 9
4 Dimri 16
heading which matches 'na'
name
1 Abhi
2 Bhavesh
3 Chaman
4 Dimri

mutate() and transmute():

Addition of new variables which are the functions of prevailing variables.

R




# Create a data frame with missing data
d <- data.frame( name = c("Abhi", "Bhavesh",
                          "Chaman", "Dimri"),
                 age = c(7, 5, 9, 16),
                 ht = c(46, NA, NA, 69),
                 school = c("yes", "yes", "no", "no") )
 
# Calculating a variable x3 which is sum of height
# and age printing with ht and age
mutate(d, x3 = ht + age)
 
# Calculating a variable x3 which is sum of height
# and age printing without ht and age
transmute(d, x3 = ht + age)


Output: 

     name age ht school
1 Abhi 7 46 yes
2 Bhavesh 5 NA yes
3 Chaman 9 NA no
4 Dimri 16 69 no
Calculating a variable x3 which is sum of height

name age ht school x3
1 Abhi 7 46 yes 53
2 Bhavesh 5 NA yes NA
3 Chaman 9 NA no NA
4 Dimri 16 69 no 85
Calculating a variable x3 which is sum of height
x3
1 53
2 NA
3 NA
4 85

summarise():

Condensing various values to one value.

R




# Create a data frame with missing data
d <- data.frame( name = c("Abhi", "Bhavesh",
                          "Chaman", "Dimri"),
                 age = c(7, 5, 9, 16),
                 ht = c(46, NA, NA, 69),
                 school = c("yes", "yes", "no", "no") )
 
# Calculating mean of age
summarise(d, mean = mean(age))
 
# Calculating min of age
summarise(d, med = min(age))
 
# Calculating max of age
summarise(d, med = max(age))
 
# Calculating median of age
summarise(d, med = median(age))


Output: 

Calculating mean of age
mean
1 9.25
Calculating minimum age
med
1 5
Calculating max of age
med
1 16
Calculating median of age
med
1 8

sample_n() and sample_frac():

For taking random specimens.

R




# Create a data frame with missing data
d <- data.frame( name = c("Abhi", "Bhavesh",
                          "Chaman", "Dimri"),
                 age = c(7, 5, 9, 16),
                 ht = c(46, NA, NA, 69),
                 school = c("yes", "yes", "no", "no") )
 
# Printing three rows
sample_n(d, 3)
 
# Printing 50 % of the rows
sample_frac(d, 0.50)


Output: 

    name age ht school
1 Chaman 9 NA no
2 Dimri 16 69 no
3 Abhi 7 46 yes
Printing 50 % of the rows
name age ht school
1 Abhi 7 46 yes
2 Dimri 16 69 no


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads