Open In App

Count non-NA values by group in DataFrame in R

Last Updated : 30 Jun, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will discuss how to count non-NA values by the group in dataframe in R Programming Language.

Method 1 : Using group_by() and summarise() methods

The dplyr package is used to perform simulations in the data by performing manipulations and transformations. The group_by() method in R programming language is used to group the specified dataframe in R. It can be used to categorize data depending on various aggregate functions like count, minimum, maximum, or sum. 

Syntax:

group_by(col-name)

On application of group_by() method, the summarize method is applied to compute a tally of the total values obtained according to each group. The summation of the non-null values is calculated using the designated column name and the aggregate method sum() supplied with the is.na() method as its argument. 

Syntax:

summarise ( new-col-name = sum(is.na (col-name))

Both the methods are applied in order to the input dataframe using the pipe operator. The output is returned in the form of a tibble, with the first column consisting of the input arguments of the group_by method and the second column being assigned the new column name specified and containing a summation of the values of each column. 

Example:

R




# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
                        col2 = letters[1:3],
                        col3 = c(1,4,NA,1,NA,NA,2,NA,2))
  
print ("Original DataFrame")
print (data_frame)
  
# grouping data by col1 and giving a total of
# non na values in col3
data_frame %>% group_by(col1) %>% summarise(
  non_na = sum(!is.na(col3)))


Output

[1] "Original DataFrame"
col1 col2 col3
1    6    a    1
2    8    b    4
3    6    c   NA
4    8    a    1
5    8    b   NA
6    9    c   NA
7    8    a    2
8    7    b   NA
9    6    c    2
# A tibble: 4 x 2    
col1 non_na
  <int>        <int>
1     6            2
2     7            0
3     8            3
4     9            0

Method 2: Using data.table

The library data.table in R is used to make statistical computations and deliberations based on the organization of data into well-defined tabular structures. The setDT method in R is used to convert lists (both named and unnamed) and dataframes to datatables by reference. The similar sum() and is.na() methods are applied over the columns of the dataframe in sequence to obtain the final output. The output returned is in the form of a data.table with row numbers followed by row identifiers followed by colon. 

Syntax:

setDT(df)[, .(new-col-name = sum(!is.na(new-col-name))), col-name]

Example:

R




# importing required libraries
library(data.table)
  
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
                        col2 = letters[1:3],
                        col3 = c(1,4,NA,1,NA,NA,2,NA,2))
  
print ("Original DataFrame")
print (data_frame)
  
# grouping data by col1 and giving a total
# of non na values in col3
mod_df <- setDT(data_frame)[, .(non_na = sum(!is.na(col3))), col1]
print ("Modified DataFrame")
print (mod_df)


Output

[1] "Original DataFrame"
col1 col2 col3
1    7    a    1
2    6    b    4
3    6    c   NA
4    7    a    1
5    9    b   NA
6    8    c   NA
7    6    a    2
8    8    b   NA
9    8    c    2
[1] "Modified DataFrame"
   col1 non_na
1:    7      2
2:    6      2
3:    9      0
4:    8      1

Method 3: Using aggregate method

The aggregate method in R is used to create the subsets produced from the result of dataframe splitting and then computes the summary statistics for each of the returned group. 

Syntax:

aggregate (x , data , FUN)

Parameter : 

x – the R storage object.

data – the dataframe or list to apply the aggregate method to. 

FUN – the function to apply to each of the groups of the dataframe.

The cbind() method in R programming language is used to produce a concatenation of the columns produced as the output. The FUN applied is the sum operation to compute the sum of the non-null values segregated based on groups. The data is the input dataframe over which the FUN is applied. 

Example:

R




# importing required libraries
library(data.table)
  
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
                        col2 = letters[1:3],
                        col3 = c(1,4,NA,1,NA,NA,2,NA,2))
  
print ("Original DataFrame")
print (data_frame)
  
# grouping data by col1 and giving a total 
# of non na values in col3
mod_df <- aggregate(cbind(
  non_na = !is.na(col3))~col1, data_frame, sum)
print ("Modified DataFrame")
print (mod_df)


Output

[1] "Original DataFrame"
col1 col2 col3
1    7    a    1
2    6    b    4
3    6    c   NA
4    7    a    1
5    9    b   NA
6    8    c   NA
7    6    a    2
8    8    b   NA
9    8    c    2
[1] "Modified DataFrame"
   col1 non_na
1    7      2
2    6      2
3    9      0
4    8      1

Method 4 : Using table() method

The library data.table in R is used to make statistical computations and deliberations based on the organization of data into well-defined tabular structures. The table() method is used to generate a contingency table of the counts after computing the combination of each of the factor levels. Therefore, it is used to perform categorical tabulation of the data. Initially, the required column to check for NA values is specified under the constraint using the is.na() function. The non-null values are then extracted and a tally of them is produced using the data.table indexing methods. 

Syntax:

is.na (df$col-name))

Example:

R




# importing required libraries
library(data.table)
  
# creating a dataframe
data_frame <- data.frame(col1 = sample(6:9, 9 , replace = TRUE),
                        col2 = letters[1:3],
                        col3 = c(1,4,NA,1,NA,NA,2,NA,2))
print ("Original DataFrame")
print (data_frame)
  
# grouping data by col1 and giving a
# total of non na values in col3
mod_df <- table(data_frame$col1[!is.na(data_frame$col3)])
print ("Modified DataFrame")
print (mod_df)


Output

[1] "Original DataFrame"
  col1 col2 col3
1    7    a    1
2    9    b    4
3    8    c   NA
4    6    a    1
5    6    b   NA
6    8    c   NA
7    9    a    2
8    9    b   NA
9    8    c    2
[1] "Modified DataFrame"
6 7 8 9  
1 1 1 2 


Similar Reads

Count non zero values in each column of R dataframe
In this article, we are going to count the number of non-zero data entries in the data using R Programming Language. To check the number of non-zero data entries in the data first we have to put that data in the data frame by using: data &lt;- data.frame(x1 = c(1,2,0,100,0,3,10), x2 = c(5,0,1,8,10,0,0), x3 = 0) print(data) Output: Now we have the d
2 min read
Count number of rows within each group in R DataFrame
DataFrame in R Programming Language may contain columns where not all values are unique. The duplicate values in the dataframe can be sectioned together into one group. The frequencies corresponding to the same columns' sequence can be captured using various external packages in R programming language. Method 1 : Using dplyr package The "dplyr" pac
5 min read
Count Unique Values by Group in R
In the article, we are going to discuss how to count the number of unique values by the group in R Programming Language. So let's take the following example, Suppose you have a dataset with multiple columns like this: class age age_group 1 A 20 YOUNG 2 B 15 KID 3 C 45 OLD 4 B 14 KID 5 A 21 YOUNG 6 A 22 YOUNG 7 C 47 OLD 8 A 19 YOUNG 9 B 16 KID 10 C
2 min read
Count the number of NA values in a DataFrame column in R
A null value in R is specified using either NaN or NA. In this article, we will see how can we count these values in a column of a dataframe. Approach Create dataframePass the column to be checked to is.na() function Syntax: is.na(column) Parameter: column: column to be searched for na values Returns: A vector with boolean values, TRUE for NA other
1 min read
How to Find and Count Missing Values in R DataFrame
In this article, we will be discussing how to find and count missing values in the R programming language. Find and Count Missing Values in the R DataFrameGenerally, missing values in the given data are represented with NA. In R programming, the missing values can be determined by is.na() method. This method accepts the data variable as a parameter
4 min read
How to find group-wise summary statistics for R dataframe?
Finding group-wise summary statistics for the dataframe is very useful in understanding our data frame. The summary includes statistical data: mean, median, min, max, and quartiles of the given dataframe. The summary can be computed on a single column or variable, or the entire dataframe. In this article, we are going to see how to find group-wise
4 min read
Create Lagged Variable by Group in R DataFrame
Lagged variable is the type of variable that contains the previous value of the variable for which we want to create the lagged variable and the first value is neglected. Data can be segregated based on different groups in R programming language and then these categories can be processed differently. Method 1 : Using dplyr package The "dplyr" packa
5 min read
Calculate difference between dataframe rows by group in R
In this article, we will see how to find the difference between rows by the group in dataframe in R programming language. Method 1: Using dplyr package The group_by method is used to divide and segregate date based on groups contained within the specific columns. The required column to group by is specified as an argument of this function. It may c
5 min read
How to calculate time difference with previous row of a dataframe by group in R
A dataframe may consist of different values belonging to groups. The columns may have values belonging to different data types or time frames as POSIXct objects. These objects allow the application of mathematical operations easily, which can be performed in the following ways : Method 1: Using dplyr package The group_by method is used to divide an
5 min read
Select First Row of Each Group in DataFrame in R
In this article, we will discuss how to select the first row of each group in Dataframe using R programming language. The duplicated() method is used to determine which of the elements of a dataframe are duplicates of other elements. The method returns a logical vector which tells which of the rows of the dataframe are duplicates. Syntax: duplicate
2 min read