# How to get summary statistics by group in R

In this article, we will learn how to get summary statistics by the group in R programming language.

**Sample dataframe in use:**

grpBy num 1 A 20 2 A 30 3 A 40 4 B 50 5 B 50 6 C 70 7 C 80 8 C 25 9 C 35 10 D 45 11 E 55 12 E 65 13 E 75 14 E 85 15 E 95 16 E 105

**Method 1: Using ****tapply()**

**tapply()**** function** in R Language is used to apply a function over a subset of vectors given by a combination of factors. This function takes 3 arguments according to the syntax. The first argument is the data column, the second argument is the column according to which the data will be grouped, in this example the data is grouped according the letters. Third argument is a function which will be applied to each group, in this example we have passed **summary() function ** as we want to compute summary statistics by group.

Syntax:tapply(df$data, df$groupBy, summary)

Parameters:

df$data:data on which summary function is to be applieddf$groupBy:column according to which the data should be grouped bysummary:summary function is applied to each group

**Example:** R program to get summary statistics by group

## R

**Output:**

$A Min. 1st Qu. Median Mean 3rd Qu. Max. 20 25 30 30 35 40 $B Min. 1st Qu. Median Mean 3rd Qu. Max. 50 50 50 50 50 50 $C Min. 1st Qu. Median Mean 3rd Qu. Max. 25.0 32.5 52.5 52.5 72.5 80.0 $D Min. 1st Qu. Median Mean 3rd Qu. Max. 45 45 45 45 45 45 $E Min. 1st Qu. Median Mean 3rd Qu. Max. 55.0 67.5 80.0 80.0 92.5 105.0

## Method 2: Using data.table approach

In this approach, we first need to import** data.table** package using library() function. Then we convert the data.frame to a data.table, data.table in R is an enhanced version of the data.frame. Due to its speed of execution and the less code to type it became popular in R. Then the most important step, we follow the syntax provided and compute the summary statistics by each group.

Syntax:setDT(df)

df[, as.list(summary(num)), by = grpBy]

Parameters:

df:dataframe objectnum:data columngrpBy:column according to which grouping is to be donesummary():function applied on each group

**Example:** R program to get summary statistics by group

## R

num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor( rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) setDT(df) df[, as.list(summary(num)), by = grpBy]

**Output:**

grpBy Min. 1st Qu. Median Mean 3rd Qu. Max. 1: A 20 25.0 30.0 30.0 35.0 40 2: B 50 50.0 50.0 50.0 50.0 50 3: C 25 32.5 52.5 52.5 72.5 80 4: D 45 45.0 45.0 45.0 45.0 45 5: E 55 67.5 80.0 80.0 92.5 105

## Method 3: Using split() function and purrr package

split() function in R Language is used to divide a data vector into groups as defined by the factor provided. We import **purrr** library using **library()** function .purrr is a functional programming toolkit. Which comes with many useful functions such as a map. The map() function iterates across all groups and returns the output as a list. It allows us to replace for loop within the code and makes it easier to read.

Syntax:df %>% split(.$grpBy) %>% map(summary)

Parameters:

df:dataframe object

grpBy:dataframe column according to which it should be grouped

**Example:** R program to get summary statistics by group

## R

num < - c(20, 30, 40, 50, 50, 70, 80, 25, 35, 45, 55, 65, 75, 85, 95, 105) char < - factor(rep(LETTERS[1:5], c(3, 2, 4, 1, 6))) df < - data.frame(grpBy=char, num=num) df % > % split(.$grpBy) % > % map(summary)

**Output:**

$A grpBy num A:3 Min. :20 B:0 1st Qu.:25 C:0 Median :30 D:0 Mean :30 E:0 3rd Qu.:35 Max. :40 $B grpBy num A:0 Min. :50 B:2 1st Qu.:50 C:0 Median :50 D:0 Mean :50 E:0 3rd Qu.:50 Max. :50 $C grpBy num A:0 Min. :25.0 B:0 1st Qu.:32.5 C:4 Median :52.5 D:0 Mean :52.5 E:0 3rd Qu.:72.5 Max. :80.0 $D grpBy num A:0 Min. :45 B:0 1st Qu.:45 C:0 Median :45 D:1 Mean :45 E:0 3rd Qu.:45 Max. :45 $E grpBy num A:0 Min. : 55.0 B:0 1st Qu.: 67.5 C:0 Median : 80.0 D:0 Mean : 80.0 E:6 3rd Qu.: 92.5 Max. :105.0

## Method 4: Using dplyr

group_by function is used to group by variable provided. Then summarize function is used to compute min, q1, median, mean, q3, max on the grouped data. These statistical values are the same values produces by **summary** function. The only difference is that here we have to explicitly call those functions upon the grouped data using **summarize** function. This function reduces a grouped column to a single value according to the function specified.

Syntax:df %>%

group_by(grpBy) %>%

summarize(min = min(num), q1 = quantile(num, 0.25), median = median(num), mean = mean(num), q3 = quantile(num, 0.75), max = max(num))

Parameters:

df:dataframe object

grpBy:column according to which grouping is to be done

**Example:** R program to get summary statistics by group

## R

num < - c(20, 30, 40, 50, 50, 70, 80, 25,
35, 45, 55, 65, 75, 85, 95, 105)
char < - factor(
rep(LETTERS[1:5], c(3, 2, 4, 1, 6)))
df < - data.frame(grpBy=char, num=num)
df % >%

group_by(grpBy) % >%

summarize(min=min(num),

q1=quantile(num, 0.25),

median=median(num),

mean=mean(num),

q3=quantile(num, 0.75),

max=max(num))

**Output:**

grpBy min q1 median mean q3 max <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 A 20 25 30 30 35 40 2 B 50 50 50 50 50 50 3 C 25 32.5 52.5 52.5 72.5 80 4 D 45 45 45 45 45 45 5 E 55 67.5 80 80 92.5 105