How to get summary statistics by group in R
In this article, we will learn how to get summary statistics by the group in R programming language.
Sample dataframe in use:
grpBy num 1 A 20 2 A 30 3 A 40 4 B 50 5 B 50 6 C 70 7 C 80 8 C 25 9 C 35 10 D 45 11 E 55 12 E 65 13 E 75 14 E 85 15 E 95 16 E 105
Method 1: Using tapply()
tapply() function in R Language is used to apply a function over a subset of vectors given by a combination of factors. This function takes 3 arguments according to the syntax. The first argument is the data column, the second argument is the column according to which the data will be grouped, in this example the data is grouped according the letters. Third argument is a function which will be applied to each group, in this example we have passed summary() function as we want to compute summary statistics by group.
Syntax: tapply(df$data, df$groupBy, summary)
Parameters:
- df$data: data on which summary function is to be applied
- df$groupBy: column according to which the data should be grouped by
- summary: summary function is applied to each group
Example: R program to get summary statistics by group
R
Output:
$A Min. 1st Qu. Median Mean 3rd Qu. Max. 20 25 30 30 35 40 $B Min. 1st Qu. Median Mean 3rd Qu. Max. 50 50 50 50 50 50 $C Min. 1st Qu. Median Mean 3rd Qu. Max. 25.0 32.5 52.5 52.5 72.5 80.0 $D Min. 1st Qu. Median Mean 3rd Qu. Max. 45 45 45 45 45 45 $E Min. 1st Qu. Median Mean 3rd Qu. Max. 55.0 67.5 80.0 80.0 92.5 105.0
Method 2: Using data.table approach
In this approach, we first need to import data.table package using library() function. Then we convert the data.frame to a data.table, data.table in R is an enhanced version of the data.frame. Due to its speed of execution and the less code to type it became popular in R. Then the most important step, we follow the syntax provided and compute the summary statistics by each group.
Syntax:
setDT(df)
df[, as.list(summary(num)), by = grpBy]
Parameters:
- df: dataframe object
- num: data column
- grpBy: column according to which grouping is to be done
- summary(): function applied on each group
Example: R program to get summary statistics by group
R
num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor( rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) setDT(df) df[, as.list(summary(num)), by = grpBy]
Output:
grpBy Min. 1st Qu. Median Mean 3rd Qu. Max. 1: A 20 25.0 30.0 30.0 35.0 40 2: B 50 50.0 50.0 50.0 50.0 50 3: C 25 32.5 52.5 52.5 72.5 80 4: D 45 45.0 45.0 45.0 45.0 45 5: E 55 67.5 80.0 80.0 92.5 105
Method 3: Using split() function and purrr package
split() function in R Language is used to divide a data vector into groups as defined by the factor provided. We import purrr library using library() function .purrr is a functional programming toolkit. Which comes with many useful functions such as a map. The map() function iterates across all groups and returns the output as a list. It allows us to replace for loop within the code and makes it easier to read.
Syntax: df %>% split(.$grpBy) %>% map(summary)
Parameters:
df: dataframe object
grpBy: dataframe column according to which it should be grouped
Example: R program to get summary statistics by group
R
num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor(rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) df % > % split(.$grpBy) % > % map(summary)
Output:
$A grpBy num A:3 Min. :20 B:0 1st Qu.:25 C:0 Median :30 D:0 Mean :30 E:0 3rd Qu.:35 Max. :40 $B grpBy num A:0 Min. :50 B:2 1st Qu.:50 C:0 Median :50 D:0 Mean :50 E:0 3rd Qu.:50 Max. :50 $C grpBy num A:0 Min. :25.0 B:0 1st Qu.:32.5 C:4 Median :52.5 D:0 Mean :52.5 E:0 3rd Qu.:72.5 Max. :80.0 $D grpBy num A:0 Min. :45 B:0 1st Qu.:45 C:0 Median :45 D:1 Mean :45 E:0 3rd Qu.:45 Max. :45 $E grpBy num A:0 Min. : 55.0 B:0 1st Qu.: 67.5 C:0 Median : 80.0 D:0 Mean : 80.0 E:6 3rd Qu.: 92.5 Max. :105.0
Method 4: Using dplyr
group_by function is used to group by variable provided. Then summarize function is used to compute min, q1, median, mean, q3, max on the grouped data. These statistical values are the same values produces by summary function. The only difference is that here we have to explicitly call those functions upon the grouped data using summarize function. This function reduces a grouped column to a single value according to the function specified.
Syntax:
df %>%
group_by(grpBy) %>%
summarize(min = min(num), q1 = quantile(num, 0.25), median = median(num), mean = mean(num), q3 = quantile(num, 0.75), max = max(num))
Parameters:
df: dataframe object
grpBy: column according to which grouping is to be done
Example: R program to get summary statistics by group
R
num < - c(20, 30, 40, 50, 50, 70, 80, 25,
35, 45, 55, 65, 75, 85, 95, 105)
char < - factor(
rep(LETTERS[1:5], c(3, 2, 4, 1, 6)))
df < - data.frame(grpBy=char, num=num)
df % >%
group_by(grpBy) % >%
summarize(min=min(num),
q1=quantile(num, 0.25),
median=median(num),
mean=mean(num),
q3=quantile(num, 0.75),
max=max(num))
Output:
grpBy min q1 median mean q3 max <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 A 20 25 30 30 35 40 2 B 50 50 50 50 50 50 3 C 25 32.5 52.5 52.5 72.5 80 4 D 45 45 45 45 45 45 5 E 55 67.5 80 80 92.5 105
Please Login to comment...