Skip to content
Related Articles

Related Articles

Improve Article

Count non-NA values by group in DataFrame in R

  • Last Updated : 30 Jun, 2021
Geek Week

In this article, we will discuss how to count non-NA values by the group in dataframe in R Programming Language.

Method 1 : Using group_by() and summarise() methods

The dplyr package is used to perform simulations in the data by performing manipulations and transformations. The group_by() method in R programming language is used to group the specified dataframe in R. It can be used to categorize data depending on various aggregate functions like count, minimum, maximum, or sum. 

Syntax:

group_by(col-name)

On application of group_by() method, the summarize method is applied to compute a tally of the total values obtained according to each group. The summation of the non-null values is calculated using the designated column name and the aggregate method sum() supplied with the is.na() method as its argument. 



Syntax:

summarise ( new-col-name = sum(is.na (col-name))

Both the methods are applied in order to the input dataframe using the pipe operator. The output is returned in the form of a tibble, with the first column consisting of the input arguments of the group_by method and the second column being assigned the new column name specified and containing a summation of the values of each column. 

Example:

R




# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
                        col2 = letters[1:3],
                        col3 = c(1,4,NA,1,NA,NA,2,NA,2))
  
print ("Original DataFrame")
print (data_frame)
  
# grouping data by col1 and giving a total of
# non na values in col3
data_frame %>% group_by(col1) %>% summarise(
  non_na = sum(!is.na(col3)))

Output

[1] "Original DataFrame"
col1 col2 col3
1    6    a    1
2    8    b    4
3    6    c   NA
4    8    a    1
5    8    b   NA
6    9    c   NA
7    8    a    2
8    7    b   NA
9    6    c    2
# A tibble: 4 x 2    
col1 non_na
  <int>        <int>
1     6            2
2     7            0
3     8            3
4     9            0

Method 2: Using data.table

The library data.table in R is used to make statistical computations and deliberations based on the organization of data into well-defined tabular structures. The setDT method in R is used to convert lists (both named and unnamed) and dataframes to datatables by reference. The similar sum() and is.na() methods are applied over the columns of the dataframe in sequence to obtain the final output. The output returned is in the form of a data.table with row numbers followed by row identifiers followed by colon. 

Syntax:



setDT(df)[, .(new-col-name = sum(!is.na(new-col-name))), col-name]

Example:

R




# importing required libraries
library(data.table)
  
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
                        col2 = letters[1:3],
                        col3 = c(1,4,NA,1,NA,NA,2,NA,2))
  
print ("Original DataFrame")
print (data_frame)
  
# grouping data by col1 and giving a total
# of non na values in col3
mod_df <- setDT(data_frame)[, .(non_na = sum(!is.na(col3))), col1]
print ("Modified DataFrame")
print (mod_df)

Output

[1] "Original DataFrame"
col1 col2 col3
1    7    a    1
2    6    b    4
3    6    c   NA
4    7    a    1
5    9    b   NA
6    8    c   NA
7    6    a    2
8    8    b   NA
9    8    c    2
[1] "Modified DataFrame"
   col1 non_na
1:    7      2
2:    6      2
3:    9      0
4:    8      1

Method 3: Using aggregate method

The aggregate method in R is used to create the subsets produced from the result of dataframe splitting and then computes the summary statistics for each of the returned group. 

Syntax:

aggregate (x , data , FUN)

Parameter : 

x – the R storage object.

data – the dataframe or list to apply the aggregate method to. 



FUN – the function to apply to each of the groups of the dataframe.

The cbind() method in R programming language is used to produce a concatenation of the columns produced as the output. The FUN applied is the sum operation to compute the sum of the non-null values segregated based on groups. The data is the input dataframe over which the FUN is applied. 

Example:

R




# importing required libraries
library(data.table)
  
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
                        col2 = letters[1:3],
                        col3 = c(1,4,NA,1,NA,NA,2,NA,2))
  
print ("Original DataFrame")
print (data_frame)
  
# grouping data by col1 and giving a total 
# of non na values in col3
mod_df <- aggregate(cbind(
  non_na = !is.na(col3))~col1, data_frame, sum)
print ("Modified DataFrame")
print (mod_df)

Output

[1] "Original DataFrame"
col1 col2 col3
1    7    a    1
2    6    b    4
3    6    c   NA
4    7    a    1
5    9    b   NA
6    8    c   NA
7    6    a    2
8    8    b   NA
9    8    c    2
[1] "Modified DataFrame"
   col1 non_na
1    7      2
2    6      2
3    9      0
4    8      1

Method 4 : Using table() method

The library data.table in R is used to make statistical computations and deliberations based on the organization of data into well-defined tabular structures. The table() method is used to generate a contingency table of the counts after computing the combination of each of the factor levels. Therefore, it is used to perform categorical tabulation of the data. Initially, the required column to check for NA values is specified under the constraint using the is.na() function. The non-null values are then extracted and a tally of them is produced using the data.table indexing methods. 

Syntax:

is.na (df$col-name))

Example:

R




# importing required libraries
library(data.table)
  
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
                        col2 = letters[1:3],
                        col3 = c(1,4,NA,1,NA,NA,2,NA,2))
print ("Original DataFrame")
print (data_frame)
  
# grouping data by col1 and giving a
# total of non na values in col3
mod_df <- table(data_frame$col1[!is.na(data_frame$col3)])
print ("Modified DataFrame")
print (mod_df)

Output

[1] "Original DataFrame"
  col1 col2 col3
1    7    a    1
2    9    b    4
3    8    c   NA
4    6    a    1
5    6    b   NA
6    8    c   NA
7    9    a    2
8    9    b   NA
9    8    c    2
[1] "Modified DataFrame"
6 7 8 9  
1 1 1 2 



My Personal Notes arrow_drop_up
Recommended Articles
Page :