Skip to content
Related Articles
Open in App
Not now

Related Articles

How to get summary statistics by group in R

Improve Article
Save Article
Like Article
  • Last Updated : 23 Aug, 2021
Improve Article
Save Article
Like Article

In this article, we will learn how to get summary statistics by the group in R programming language.

Sample dataframe in use:

   grpBy num
1      A  20
2      A  30
3      A  40
4      B  50
5      B  50
6      C  70
7      C  80
8      C  25
9      C  35
10     D  45
11     E  55
12     E  65
13     E  75
14     E  85
15     E  95
16     E 105

Method 1: Using tapply()

tapply() function in R Language is used to apply a function over a subset of vectors given by a combination of factors. This function takes 3 arguments according to the syntax. The first argument is the data column, the second argument is the column according to which the data will be grouped, in this example the data is grouped according the letters. Third argument is a function which will be applied to each group, in this example we have passed summary() function  as we want to compute summary statistics by group.

Syntax: tapply(df$data, df$groupBy, summary)

Parameters:

  • df$data: data on which summary function is to be applied
  • df$groupBy: column according to which the data should be grouped by
  • summary: summary function is applied to each group

Example: R program to get summary statistics by group

R




num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor( rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) tapply(df$num, df$grpBy, summary)

Output:

$A
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     20      25      30      30      35      40 
$B
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     50      50      50      50      50      50 
$C
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   25.0    32.5    52.5    52.5    72.5    80.0 
$D
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     45      45      45      45      45      45 
$E
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   55.0    67.5    80.0    80.0    92.5   105.0 

Method 2:  Using data.table approach

In this approach, we first need to import data.table package using library() function. Then we convert the data.frame to a data.table, data.table  in R is an enhanced version of the data.frame. Due to its speed of execution and the less code to type it became popular in R. Then the most important step, we follow the syntax provided and compute the summary statistics by each group.

Syntax:

setDT(df)

df[, as.list(summary(num)), by = grpBy]

Parameters:

  • df: dataframe object
  • num: data column
  • grpBy: column according to which grouping is to be done
  • summary(): function applied on each group

Example: R program to get summary statistics by group

R




library(data.table)

num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor( rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) setDT(df) df[, as.list(summary(num)), by = grpBy]

Output:

   grpBy Min. 1st Qu. Median Mean 3rd Qu. Max.
1:     A   20    25.0   30.0 30.0    35.0   40
2:     B   50    50.0   50.0 50.0    50.0   50
3:     C   25    32.5   52.5 52.5    72.5   80
4:     D   45    45.0   45.0 45.0    45.0   45
5:     E   55    67.5   80.0 80.0    92.5  105

Method 3: Using split() function and purrr package

split() function in R Language is used to divide a data vector into groups as defined by the factor provided. We import purrr library using library() function .purrr is a functional programming toolkit. Which comes with many useful functions such as a map. The map() function iterates across all groups and returns the output as a list. It allows us to replace for loop within the code and makes it easier to read.

Syntax: df %>% split(.$grpBy) %>% map(summary)

Parameters:

df: dataframe object

grpBy: dataframe column according to which it should be grouped

Example: R program to get summary statistics by group

R




library(purrr)

num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor(rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) df % > % split(.$grpBy) % > % map(summary)

Output:

$A
 grpBy      num    
 A:3   Min.   :20  
 B:0   1st Qu.:25  
 C:0   Median :30  
 D:0   Mean   :30  
 E:0   3rd Qu.:35  
       Max.   :40  
$B
 grpBy      num    
 A:0   Min.   :50  
 B:2   1st Qu.:50  
 C:0   Median :50  
 D:0   Mean   :50  
 E:0   3rd Qu.:50  
       Max.   :50  
$C
 grpBy      num      
 A:0   Min.   :25.0  
 B:0   1st Qu.:32.5  
 C:4   Median :52.5  
 D:0   Mean   :52.5  
 E:0   3rd Qu.:72.5  
       Max.   :80.0  
$D
 grpBy      num    
 A:0   Min.   :45  
 B:0   1st Qu.:45  
 C:0   Median :45  
 D:1   Mean   :45  
 E:0   3rd Qu.:45  
       Max.   :45  
$E
 grpBy      num       
 A:0   Min.   : 55.0  
 B:0   1st Qu.: 67.5  
 C:0   Median : 80.0  
 D:0   Mean   : 80.0  
 E:6   3rd Qu.: 92.5  
       Max.   :105.0  

Method 4: Using dplyr

group_by function is used to group by variable provided. Then summarize function is used to compute min, q1, median, mean, q3, max on the grouped data. These statistical values are the same values produces by summary function. The only difference is that here we have to explicitly call those functions upon the grouped data using summarize function. This function reduces a grouped column to a single value according to the function specified.

Syntax: 

df %>%                            

 group_by(grpBy) %>%

 summarize(min = min(num), q1 = quantile(num, 0.25), median = median(num), mean = mean(num), q3 = quantile(num, 0.75), max = max(num))

Parameters: 

df: dataframe object 

grpBy: column according to which grouping is to be done

Example: R program to get summary statistics by group

R




library(dplyr)

num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor( rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) df % >%
group_by(grpBy) % >%
summarize(min=min(num),
q1=quantile(num, 0.25),
median=median(num),
mean=mean(num),
q3=quantile(num, 0.75),
max=max(num))

Output:

  grpBy   min    q1 median  mean    q3   max
  <fct> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 A        20  25     30    30    35      40
2 B        50  50     50    50    50      50
3 C        25  32.5   52.5  52.5  72.5    80
4 D        45  45     45    45    45      45
5 E        55  67.5   80    80    92.5   105

My Personal Notes arrow_drop_up
Like Article
Save Article
Related Articles

Start Your Coding Journey Now!