How to Calculate Summary Statistics by Group in R?

In this article, we will discuss how to calculate summary statistics by the group in the R programming language.

What is summary statistics in R?

Summary Statistics by Group in R Programming Language are numerical or graphical representations that provide a concise and informative overview of a dataset. They help you understand the central tendencies, dispersion, and shape of your data. R offers various functions and tools to compute and visualize summary statistics. Some common summary statistics in R include.

Summary statistics will return the following from the given data

Min – Minimum value in the given data
1st Quartile – first quartile in the data
Median – Median of the data
Mean – Mean of the data
3rd Quartile – third quartile in the data
Max – Maximum value in the given data

Let’s create the dataframe

# create dataframe with 4 columns

data=data.frame(name=c("ojaswi","bobby","rohith","gnanesh","sireesha"),

                subjects=c("java","java","python","cpp","python"),

                age=c(21,23,21,20,19),

                id=c(1,2,3,4,5))
 
# display
data

Output:

      name subjects age id
1   ojaswi     java  21  1
2    bobby     java  23  2
3   rohith   python  21  3
4  gnanesh      cpp  20  4
5 sireesha   python  19  5

Method 1: Using tapply() Function

In this method to calculate the summary statistics by group, the user needs to simply call the inbuilt tapply() function with the summary argument of this function passed with the given data for which the summary statistics is to be calculated, and under this method, user will take a summary function as the third parameter in the R language.

Syntax:

tapply(data$column_name, data$group_column, summary)

Parameters:

data is the input dataframe
column_name is the column to be summarized
group_column is the column to be grouped
summary is to get the summary data

We are going to display a summary by grouping subjects with age using the tapply() function with the summary argument in the R language.

# create dataframe with 4 columns

data=data.frame(name=c("ojaswi","bobby","rohith","gnanesh","sireesha"),

                subjects=c("java","java","python","cpp","python"),

                age=c(21,23,21,20,19),

                id=c(1,2,3,4,5))
 
# display summary by grouping subjects with age

tapply(data$age, data$subjects, summary)

Output:

$cpp
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     20      20      20      20      20      20 

$java
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   21.0    21.5    22.0    22.0    22.5    23.0 

$python
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   19.0    19.5    20.0    20.0    20.5    21.0

Method 2: Using purrr Package

In this method Summary Statistics by Group the user has to first install and import the purr package, then the user has to follow the below syntax to calculate the summary statistics by a group of the given data in the R language.

install.package('purr')
library('purr')

Syntax:

data %>% split(.$group_column) %>%map(summary)

where,

data is the input dataframe
group_column is the column to be grouped
summary is the function to get summary

We are displaying a summary by grouping subjects with the help of the purr package in the R language.

# load the library

library("purrr")
 
# create dataframe with 4 columns

data=data.frame(name=c("ojaswi","bobby","rohith","gnanesh","sireesha"),

                subjects=c("java","java","python","cpp","python"),

                age=c(21,23,21,20,19),

                id=c(1,2,3,4,5))
 
# display summary by grouping subjects

data %>% split(.$subjects) %>%map(summary)

Output:

$cpp
     name             subjects              age           id   
 Length:1           Length:1           Min.   :20   Min.   :4  
 Class :character   Class :character   1st Qu.:20   1st Qu.:4  
 Mode  :character   Mode  :character   Median :20   Median :4  
                                       Mean   :20   Mean   :4  
                                       3rd Qu.:20   3rd Qu.:4  
                                       Max.   :20   Max.   :4  

$java
     name             subjects              age             id      
 Length:2           Length:2           Min.   :21.0   Min.   :1.00  
 Class :character   Class :character   1st Qu.:21.5   1st Qu.:1.25  
 Mode  :character   Mode  :character   Median :22.0   Median :1.50  
                                       Mean   :22.0   Mean   :1.50  
                                       3rd Qu.:22.5   3rd Qu.:1.75  
                                       Max.   :23.0   Max.   :2.00  

$python
     name             subjects              age             id     
 Length:2           Length:2           Min.   :19.0   Min.   :3.0  
 Class :character   Class :character   1st Qu.:19.5   1st Qu.:3.5  
 Mode  :character   Mode  :character   Median :20.0   Median :4.0  
                                       Mean   :20.0   Mean   :4.0  
                                       3rd Qu.:20.5   3rd Qu.:4.5  
                                       Max.   :21.0   Max.   :5.0

Method 3: Using dplyr Package

In this approach Summary Statistics by Groupthe user has to install and import the dplyr package in the working R console and then follow the below syntax with group_by and summarize() function to get summary by group in the R language.

install.package('dplyr')
library('dplyr')

Syntax:

data %>% group_by(group_column) %>% summarize(min = min(column),
            q1 = quantile(column, 0.25),
            median = median(column),
            mean = mean(column),
            q3 = quantile(column, 0.75),
            max = max(column))

Parameters:

min(column) – to get the minimum of the column
max(column) – to get the maximum of the column
median(column) – to get the median of the column
mean(column) – to get the mean of the column
quantile(column, 0.25) – to get the first quartile of the column
quantile(column, 0.75) – to get the third quartile of the column

We are displaying a summary by grouping subjects with age columns using the dplyr package in the R programming language.

# load the library

library("dplyr")
 
# create dataframe with 4 columns

data=data.frame(name=c("ojaswi","bobby","rohith","gnanesh","sireesha"),

                subjects=c("java","java","python","cpp","python"),

                age=c(21,23,21,20,19),

                id=c(1,2,3,4,5))
 
# display summary by grouping subjects with age column

data %>% group_by(subjects) %>% summarize(min = min(age),

            q1 = quantile(age, 0.25),

            median = median(age),

            mean = mean(age),

            q3 = quantile(age, 0.75),

            max = max(age))

Output:

# A tibble: 3 × 7
  subjects   min    q1 median  mean    q3   max
  <chr>    <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 cpp         20  20       20    20  20      20
2 java        21  21.5     22    22  22.5    23
3 python      19  19.5     20    20  20.5    21

Article Tags :

R Language