Open In App

How to find the percentage of missing values in a dataframe in R?

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, let’s discuss how to find the percentage of missing values (NAs) in R Programming Language. Percentage of NAs denote the fraction of data cells that are not defined by a definite cell value. The percentage of NA values can be calculated using the following formula :

Percentage of NAs = (Number of cells with NA) * 100 /(Total number of cells)

Method 1: The total number of cells can be found by using the product of the inbuilt dim() function in R, which returns two values, each indicating the number of rows and columns respectively. 

The number of cells with NA values can be computed by using the sum() and is.na() functions in R respectively. The following code snippet first evaluates each data cell value to return a logical value of true if there is a missing value and false, if not. Then, the summation of these NA values is done using sum() function. 

sum(is.na(data_frame))

R




# declaring a data frame in R
data_frame = data.frame(C1= c(1, 2, NA, 0),
                        C2= c( NA, NA, 3, 8),
                        C3= c("A", "V", "j", "y"))
  
print("Original data frame")
print(data_frame)
  
# calculating the product of dimensions of dataframe 
totalcells = prod(dim(data_frame))
print("Total number of cells ")
print(totalcells)
  
# calculating the number of cells with na
missingcells = sum(is.na(data_frame))
print("Missing value cells")
print(missingcells)
  
# calculating percentage of missing values
percentage = (missingcells * 100 )/(totalcells)
print("Percentage of missing values' cells")
print (percentage)


Output

[1] "Original data frame"
 C1 C2 C3
1  1 NA  A
2  2 NA  V
3 NA  3  j
4  0  8  y
[1] "Total number of cells "
[1] 12
[1] "Missing value cells"
[1] 3
[1] "Percentage of missing values' cells"
[1] 25

Method 2: We can simply use the mean() function in R, to carry out the division of missing cells by the total number of cells. is.na() function is first used to determine whether the data cell value is true or false and then the mean() method is applied over it. The time complexity required is polynomial with respect to the size of data frame, since each data cell value is evaluated. 

R




# declaring a data frame in R
data_frame = data.frame(C1= c(1, 2, NA, 0),
                        C2= c( NA, NA, 3, 8), 
                        C3= c("A", "V", "j", "y"),
                        C4=c(NA,NA,NA,NA))
  
print("Original data frame")
print(data_frame)
  
# calculating percentage of missing values
percentage = mean(is.na(data_frame)) * 100
print ("percentage of missing values")
print (percentage)


Output

[1] "Original data frame"
 C1 C2 C3 C4
1  1 NA  A NA
2  2 NA  V NA
3 NA  3  j NA
4  0  8  y NA
[1] "percentage of missing values"
[1] 43.75


Last Updated : 07 Apr, 2021
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads