Create Lagged Variable by Group in R DataFrame

Last Updated : 29 Jul, 2022

Lagged variable is the type of variable that contains the previous value of the variable for which we want to create the lagged variable and the first value is neglected. Data can be segregated based on different groups in R programming language and then these categories can be processed differently.

Method 1 : Using dplyr package

The “dplyr” package in R language is used to perform data enhancements and manipulations and can be loaded into the working space.

group_by() method in R can be used to categorize data into groups based on either a single column or a group of multiple columns. All the plausible unique combinations of the input columns are stacked together as a single group.

Syntax:

group_by(args .. ),

where the args contain a sequence of column to group data upon

This is followed by the application of the mutate() method over the data frame which is used to simulate creation, deletion and modification of data frame columns. mutate() method adds new variables as well as preserves the existing ones. The mutate method takes as an argument the lag() method to perform transmutations on the data. The lag() method is used to induce lagged values for the specified variable.

Syntax:

lag(col, n = 1L, default = NA)

Parameters :

col – The column of the data frame to introduce lagged values in.

n – (Default : 1) The number of positions to lead or lag by

default – (Default : NA) Value used for non-existent rows.

The first instance of the occurrence of the variable in the lag() input column’s attribute is replaced by NA. All the successive instances as replaced by the previous value that was assigned to the same group.

The result of these methods is in the form of a tibble which is a table-like structure and proper information about the number of groups and column class is returned.

Example 1:

R

library("dplyr")
 
# creating a data frame
data_frame <- data.frame(col1 = rep(c(1:3), each = 3),
                         col2 = letters[1:3]
                         )
 
print ("Original DataFrame")
print (data_frame)
 
data_mod <- data_frame %>%                            
  group_by(col1) %>%
  dplyr::mutate(laggedval = lag(col2, n = 1, default = NA)) 
 
print ("Modified Data")
print (data_mod)

Output

[1] "Original DataFrame" 
col1 col2 
1    1    a 
2    1    b
3    1    c 
4    2    a 
5    2    b 
6    2    c 
7    3    a 
8    3    b 
9    3    c 
[1] "Modified Data" 
# A tibble: 9 x 3 
# Groups:   col1 [3]    
col1 col2  laggedval   
<int> <fct> <fct>     
1     1 a     NA        
2     1 b     a         
3     1 c     b         
4     2 a     NA        
5     2 b     a         
6     2 c     b         
7     3 a     NA        
8     3 b     a         
9     3 c     b

Grouping can be done based on multiple columns, where the groups created are dependent on the different possible unique sets that can be created out of all the combinations of the involved columns.

Example 2:

R

library("tidyverse")
 
# creating a data frame
data_frame <- data.frame(col1 = rep(c(1:3), each = 3),
                         col2 = letters[1:3],
                         col3 = c(1,4,1,2,2,2,1,2,2))
 
print ("Original DataFrame")
print (data_frame)
 
print ("Modified DataFrame")
data_mod <- data_frame %>%                            
  group_by(col1,col3) %>%
  dplyr::mutate(laggedval = lag(col2, n = 1, default = NA)) 
 
print ("Modified Data")
print (data_mod) 

Output

[1] "Original DataFrame" 
   col1 col2 col3 
1    1    a    1 
2    1    b    4 
3    1    c    1 
4    2    a    2 
5    2    b    2 
6    2    c    2 
7    3    a    1 
8    3    b    2 
9    3    c    2 
[1] "Modified DataFrame" 
[1] "Modified Data" 
# A tibble: 9 x 4 
# Groups:   col1, col3 [5]    
col1 col2   col3 laggedval   
  <int> <fct> <dbl> <fct>     
1     1 a         1 NA        
2     1 b         4 NA        
3     1 c         1 a         
4     2 a         2 NA        
5     2 b         2 a         
6     2 c         2 b         
7     3 a         1 NA        
8     3 b         2 NA        
9     3 c         2 b

Method 2 : Using duplicated()

Initially, the number of rows of the data frame are fetched using the nrow() method in R language. This is followed by the extraction of values from the column to introduce lagged values in excluding the last row value. This will return a vector of one missing value (induced for the last row) followed by the row values in order of the desired column.

The first instance of every group occurrence is then identified by the duplicated() method and replaced by NA using the which() method. These values’ modification is stored in the new column name assigned to the data frame.

Example:

R

# creating a data frame
data_frame <- data.frame(col1 = rep(c(1:3), each = 3),
                         col2 = letters[1:3]
                         )
 
print ("Original DataFrame")
print (data_frame)
 
# getting the last row col index
last_row <- -nrow(data_frame)
excl_last_row <- as.character(data_frame$col2[last_row])
 
# create a vector of values of NA and col2  
data_frame$lag_value <- c( NA, excl_last_row)
 
# replace first occurrence by NA
data_frame$lag_value[which(!duplicated(data_frame$col1))] <- NA
print ("Modified Data")
print (data_frame)   

Output

[1] "Original DataFrame" 
   col1 col2 
1    1    a 
2    1    b 
3    1    c 
4    2    a 
5    2    b 
6    2    c 
7    3    a 
8    3    b 
9    3    c 
[1] "Modified Data" 
  col1 col2 lag_value 
1    1    a      <NA> 
2    1    b         a 
3    1    c         b 
4    2    a      <NA> 
5    2    b         a 
6    2    c         b 
7    3    a      <NA> 
8    3    b         a 
9    3    c         b

Suggest improvement

Count non-NA values by group in DataFrame in R

Share your thoughts in the comments

Create Lagged Variable by Group in R DataFrame

Method 1 : Using dplyr package

R

R

Method 2 : Using duplicated()

R

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?