How to Aggregate multiple columns in Data.table in R ?

Last Updated : 17 Jun, 2021

In this article, we will discuss how to aggregate multiple columns in Data.table in R Programming Language.

A data.table contains elements that may be either duplicate or unique. As a result of this, the variables are divided into categories depending on the sets in which they can be segregated. The column values can be summed over such that the columns contains summation of frequency counts of variables.

Method 1 : Using list() method

The standard data table indexing methods can be used to segregate and aggregate data contained in a data frame. The “by” attribute is used to divide the data based on the specific column names, provided inside the list() method. A new variable can be added containing the sum of values obtained using the sum() method containing the columns to be summed over.

Syntax:

df[ , new-col-name:=sum(reqd-col-name), by = list(grouping columns)]

Example:

R

# importing required libraries 
library(data.table) 
  
# creating a data frame 
data_frame <- data.table(col1 = sample(6:8, 12 , replace = TRUE), 
                         col2 = letters[1:2], 
                         col3 = 1) 
  
print ("Original DataFrame") 
print (data_frame) 
  
# compute the sum of col3 using grouping  
# of col1 and col2 
data_frame[ , count:=sum(col3), by = list(col1, col2)]  
  
print ("Modified DataFrame") 
print (data_frame) 

Output

[1] "Original DataFrame" 
   col1 col2 col3  
1:    7    a    1  
2:    6    b    1  
3:    8    a    1  
4:    7    b    1  
5:    6    a    1  
6:    8    b    1  
7:    7    a    1  
8:    8    b    1  
9:    8    a    1 
10:   7    b    1 
11:   6    a    1 
12:   7    b    1 
[1] "Modified DataFrame" 
  col1 col2 col3 count  
1:    7    a    1   2  
2:    6    b    1   1  
3:    8    a    1   2  
4:    7    b    1   3  
5:    6    a    1   2  
6:    8    b    1   2  
7:    7    a    1   2  
8:    8    b    1   2  
9:    8    a    1   2 
10:   7    b    1   3 
11:   6    a    1   2 
12:   7    b    1   3

Method 2 : Using lapply()

The data.table library can be installed and loaded into the working space. The lapply() method can then be applied over this data.table object, to aggregate multiple columns using a group. The lapply() method is used to return an object of the same length as that of the input list. Each element returned is the result of the application of function, FUN.

Syntax:

lapply(obj, FUN, …)

Parameters :

obj – a vector (atomic or list) or an expression object.

FUN – the function to be applied over elements.

The .SD attribute is used to calculate summary statistics for a larger list of variables. The sum function is applied as the function to compute the sum of the elements categorically falling within each group variable. The ‘by’ attribute is equivalent to the ‘group by’ in SQL while performing aggregation.

Syntax:

lapply (.SD , sum )

Example:

R

# importing required libraries 
library(data.table) 
  
# creating a data frame 
data_frame <- data.table(col1 = sample(6:8, 12 , replace = TRUE), 
                         col2 = 1:2, 
                         col3 = 1) 
  
print ("Original DataFrame") 
print (data_frame) 
  
print ("Modified DataFrame") 
data_frame[, lapply(.SD, sum), by= col1]

Output

[1] "Original DataFrame"
col1 col2 col3  
1:    7    1    1  
2:    7    2    1  
3:    8    1    1  
4:    8    2    1  
5:    8    1    1  
6:    6    2    1  
7:    7    1    1  
8:    7    2    1  
9:    8    1    1 
10:    6    2    1 
11:    7    1    1 
12:    7    2    1 
[1] "Modified DataFrame" 
   col1 col2 col3 
1:    7    9    6 
2:    8    5    4 
3:    6    4    2

Method 3: Using aggregate method

The aggregate() function in R is used to produce summary statistics for one or more variables in a data frame or a data.table respectively.

Syntax:

aggregate(sum_var ~ group_var, data = df, FUN = sum)

Parameters :

sum_var – The columns to compute sums for

group_var – The columns to group data by

data – The data frame to take

FUN – The function to apply

In case, the grouped variable are a combination of columns, the cbind() method is used to combine columns to be retrieved. The FUN to be applied is equivalent to sum, where each column’s summation over particular categorical group is returned.

Example:

R

# importing required libraries 
library(data.table) 
  
# creating a data frame 
data_frame <- data.table(col1 = sample(6:8, 12 , replace = TRUE), 
                         col2 = 1:2, 
                         col3 = 1) 
  
print ("Original DataFrame") 
print (data_frame) 
  
print ("Modified DataFrame") 
aggregate(cbind(col2,col3) ~ col1, data = data_frame, FUN = sum)

Output

[1] "Original DataFrame"
col1 col2 col3 
1:    6    1    1  
2:    6    2    1  
3:    7    1    1  
4:    6    2    1  
5:    6    1    1  
6:    7    2    1  
7:    8    1    1  
8:    6    2    1  
9:    8    1    1 
10:    6    2    1 
11:    7    1    1 
12:    7    2   
[1] "Modified DataFrame" 
col1 col2 col3 
1    6   10    6 
2    7    6    4 
3    8    2    2

Suggest improvement

How to edit CSV files in R

Control Point Border Thickness in ggplot2 in R

Share your thoughts in the comments

How to Aggregate multiple columns in Data.table in R ?

Method 1 : Using list() method

R

Method 2 : Using lapply()

R

Method 3: Using aggregate method

R

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?