Skip to content
Related Articles

Related Articles

Improve Article

How to calculate time difference with previous row of a dataframe by group in R

  • Last Updated : 29 Jun, 2021

A dataframe may consist of different values belonging to groups. The columns may have values belonging to different data types or time frames as POSIXct objects. These objects allow the application of mathematical operations easily, which can be performed in the following ways : 

Method 1: Using dplyr package

The group_by method is used to divide and segregate date based on groups contained within the specific columns. The required column to group by is specified as an argument of this function. It may contain multiple column names.

Syntax:

group_by(col1, col2, …)

This is followed by the application of mutate() method which is used to shift orientations and perform manipulations in the data. The new column name can be specified using the new column name. The difference from the previous row can be calculated using the lag() method of this library. This method finds the previous values in a vector.



Syntax:

lag(x, n = 1L, default = NA)

Parameter :

  • x  – A vector of values
  • n – Number of positions to lag by
  • default (Default : NA)- the value used for non-existent rows. 

The first occurrences of any group are replaced by NA values. 

Example:

R




library(dplyr)
  
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 5 , replace = TRUE),
                         col3 =  c(as.POSIXct("2021-05-08 08:32:07"),
                                   as.POSIXct("2021-07-18 00:21:07"),
                                   as.POSIXct("2020-11-28 23:32:09"),
                                   as.POSIXct("2021-05-11 18:32:07"),
                                   as.POSIXct("2021-05-08 08:32:07"))
                         )
print ("Original DataFrame")
print (data_frame)
  
# comouting difference of each group
data_frame %>%
  arrange(col1, col3) %>%
  group_by(col1) %>%
  mutate(diff = col3 - lag(col3))

Output

[1] "Original DataFrame" 
   col1                col3
1    8 2021-05-08 08:32:07 
2    8 2021-07-18 00:21:07 
3    7 2020-11-28 23:32:09 
4    6 2021-05-11 18:32:07 
5    7 2021-05-08 08:32:07 
# A tibble: 5 x 3 
# Groups:   col1 [3]    
           col1 col3                diff            
         <int> <dttm>              <drtn>        
1     6 2021-05-11 18:32:07       NA secs 
2     7 2020-11-28 23:32:09       NA secs 
3     7 2021-05-08 08:32:07 13856398 secs 
4     8 2021-05-08 08:32:07       NA secs 
5     8 2021-07-18 00:21:07  6104940 secs

Method 2: Using tapply method

The tapply() method is used to apply a function on the list or dataframe object. The specified function, which may be user-defined or pre-defined is applied to each cell of the dataframe object.



Syntax: 

tapply(X, INDEX, FUN )

Parameter : 

  • X – an R object, a dataframe. Typically vector-like, allowing sub-setting with [.
  • INDEX – a list of one or more factors, each of same length as X. The elements are coerced to factors by as.factor.
  • FUN – a function to be applied

In this scenario, the function is to compute the difference in time frames, with the units as seconds. All the first instances of the values encountered in a group are replaced by zero. 

Example:

R




# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 5 , replace = TRUE),
                         col3 =  c(as.POSIXct("2021-05-08 08:32:07"),
                                   as.POSIXct("2021-07-18 00:21:07"),
                                   as.POSIXct("2020-11-28 23:32:09"),
                                   as.POSIXct("2021-05-11 18:32:07"),
                                   as.POSIXct("2021-05-08 08:32:07"))
                         )
print ("Original DataFrame")
print (data_frame)
  
# comouting difference of each group
data_frame$diff <- unlist(tapply(data_frame$col3, INDEX = data_frame$col1,
                          FUN = function(x) c(0, `units<-`(diff(x), "secs"))))
                                   
print ("Modified DataFrame")
print (data_frame)

Output

[1] "Original DataFrame" 
col1                col3 
1    7 2021-05-08 08:32:07 
2    6 2021-07-18 00:21:07 
3    8 2020-11-28 23:32:09 
4    7 2021-05-11 18:32:07 
5    6 2021-05-08 08:32:07 
[1] "Modified DataFrame" 
col1                col3     diff 
1    7 2021-05-08 08:32:07        0 
2    6 2021-07-18 00:21:07 -6104940 
3    8 2020-11-28 23:32:09        0 
4    7 2021-05-11 18:32:07   295200 
5    6 2021-05-08 08:32:07        0

Method 3: Using data.table

A new column can be added to calculate the time difference between rows of the data.table. The difftime() method can be used to calculate this difference. It is used to calculate the time intervals or differences. 

Syntax:

difftime (t1 , t2 , units)



Parameter : 

  • t1, t2 – date-time or date objects.
  • units – units in the form of character string to return the result

In order to find the next time zone value, that is, t2 to be applied in difftime(), the shift() method is used to induce lead or lag in the specified input vector or list. 

Syntax:

shift (x , fill )

Parameter : 

  • x – A vector, list, data.frame or data.table.
  • fill – indicator of the padding value to introduce

The by attribute is added to the group the data by the specified column name. 

Example:

R




library("data.table")
  
# creating a dataframe
dt <- data.table(col1 = sample(6:9, 5 , replace = TRUE),
                         col3 =  c(as.POSIXct("2021-05-08 08:32:07"),
                                   as.POSIXct("2021-07-18 00:21:07"),
                                   as.POSIXct("2020-11-28 23:32:09"),
                                   as.POSIXct("2021-05-11 18:32:07"),
                                   as.POSIXct("2021-05-08 08:32:07"))
                         )
print ("Original DataFrame")
print (dt)
  
# comouting difference of each group
dt[, diff := difftime(col3, shift(col3, fill=col3[1L]),
                      units="secs"), by=col1]
  
print ("Modified DataFrame")
print (dt)

Output

[1] "Original DataFrame" 
col1                col3 
1:    7 2021-05-08 08:32:07 
2:    7 2021-07-18 00:21:07 
3:    8 2020-11-28 23:32:09 
4:    8 2021-05-11 18:32:07 
5:    8 2021-05-08 08:32:07 
[1] "Modified DataFrame" 
col1                col3         diff 
1:    7 2021-05-08 08:32:07        0 secs 
2:    7 2021-07-18 00:21:07  6104940 secs 
3:    8 2020-11-28 23:32:09        0 secs 
4:    8 2021-05-11 18:32:07 14151598 secs 
5:    8 2021-05-08 08:32:07  -295200 secs



My Personal Notes arrow_drop_up
Recommended Articles
Page :