Open In App

Calculate difference between dataframe rows by group in R

Last Updated : 16 Dec, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will see how to find the difference between rows by the group in dataframe in R programming language.

Method 1: Using dplyr package

The group_by method is used to divide and segregate date based on groups contained within the specific columns. The required column to group by is specified as an argument of this function. It may contain multiple column names. 

Syntax:

group_by(col1, col2, …)

This is followed by the application of mutate() method which is used to shift orientations and perform manipulations in the data. The new column name can be specified using the new column name. The difference from the previous row can be calculated using the lag() method of this library. This method finds the previous values in a vector. 

Syntax:

lag(x, n = 1L, default = NA)

Parameter: 

  • x  – A vector of values
  • n – Number of positions to lag by
  • default (Default : NA)- the value used for non-existent rows.

A mutation is introduced in the data frame by using the lag of the column value subtracted from the specified column’s particular row. The default value is the first value of that particular group using the first(col-name).

Example:

R




# installing required libraries
library("dplyr")
 
# creating a data frame
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
                         col2 = letters[1:3],
                         col3 = c(1,4,5,1,NA,NA,2,NA,2))
 
print ("Original DataFrame")
print (data_frame)
 
print ("Modified DataFrame")
 
# computing difference of each group
data_frame%>%group_by(col1)%>%mutate(diff=col3-lag(
  col3,default=first(col3)))


 
Output 

[1] "Original DataFrame" 
  col1 col2 col3 
1    6    a    1 
2    9    b    4 
3    7    c    5 
4    6    a    1 
5    6    b   NA 
6    9    c   NA 
7    6    a    2 
8    8    b   NA 
9    7    c    2 
[1] "Modified DataFrame" 
    # A tibble: 9 x 4 
  # Groups:   col1 [4]    
   col1 col2   col3  diff   
  <int> <chr> <dbl> <dbl> 
1     6 a         1     0 
2     9 b         4     0 
3     7 c         5     0 
4     6 a         1     0 
5     6 b        NA    NA 
6     9 c        NA    NA 
7     6 a         2    NA 
8     8 b        NA    NA
9     7 c         2    -3

Method 2 : Using data.table package

The data frame indexing methods can be used to calculate the difference of rows by group in R. The ‘by’ attribute is to specify the column to group the data by. All the rows are retained, while a new column is added in the set of columns, using the column to take to compute the difference of rows by the group. The difference is calculated by using the particular row of the specified column and subtracting from it the previous value computed using the shift() method. The shift method is used to lag vectors or lists. 

Syntax: 

data_frame[ , new-col-name := reqd-col – shift(reqd-col), by = grouping-col]

The first instance of that particular group is replaced by NA in that particular column. 

Example: 

R




# installing required libraries
library("data.table")
 
# creating a data frame
data_frame <- data.table(col1 = sample(6:9, 9 , replace = TRUE),
                         col2 = letters[1:3],
                         col3 = c(1,4,5,1,9,11,2,7,2))
 
print ("Original DataFrame")
print (data_frame)
 
# computing difference of each group
data_frame[ , diff := col3 - shift(col3), by = col1]
print ("Modified DataFrame")
print (data_frame)


 
Output 

[1] "Original DataFrame" 
col1 col2 col3 
1:    8    a    1 
2:    8    b    4 
3:    7    c    5 
4:    6    a    1 
5:    6    b    9 
6:    8    c   11 
7:    8    a    2 
8:    9    b    7 
9:    7    c    2 
[1] "Modified DataFrame" 
   col1 col2 col3 diff 
1:    8    a    1   NA 
2:    8    b    4    3 
3:    7    c    5   NA 
4:    6    a    1   NA 
5:    6    b    9    8 
6:    8    c   11    7 
7:    8    a    2   -9 
8:    9    b    7   NA 
9:    7    c    2   -3

Method 3 : Using ave() method

The ave() method in base R is used to group averages over the level combinations of factors.  

Syntax:

ave(x, group , FUN = mean)

Parameter : 

  • x – the required data frame column
  • group – the grouping variables
  • FUN – The function to apply for each factor level combination.

The function here is to compute the difference of a particular column in that row and the difference of the previous row with it. The first instance of that particular group is replaced by NA in that particular column.  

Example: 

R




# creating a data frame
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
                         col2 = letters[1:3],
                         col3 = c(1,4,5,1,9,11,2,7,2))
 
print ("Original DataFrame")
print (data_frame)
 
# computing difference of each group
data_frame$diff <- ave(data_frame$col3, factor(data_frame$col1),
                       FUN=function(x) c(NA,diff(x)))
                        
print ("Modified DataFrame")
print (data_frame)


 
Output 

[1] "Original DataFrame" 
col1 col2 col3 
1    9    a    1 
2    9    b    4 
3    6    c    5 
4    7    a    1 
5    6    b    9 
6    7    c   11
7    9    a    2 
8    9    b    7 
9    9    c    2
[1] "Modified DataFrame" 
col1 col2 col3 diff 
1    9    a    1   NA 
2    9    b    4    3 
3    6    c    5   NA 
4    7    a    1   NA 
5    6    b    9    4 
6    7    c   11   10 
7    9    a    2   -2 
8    9    b    7    5 
9    9    c    2   -5

 



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads