Open In App

Remove Rows with NA Using dplyr Package in R

Last Updated : 02 Feb, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

NA means Not Available is often used for missing values in a dataset. In Machine Learning NA values are a common problem and if not treated properly can create severe issues during data analysis. NA is also referred as NaN which means Not a number. To understand NA values we can think of an admission form with different columns including Blood Group. Now while filling up the form few students forgot to write it and some didn’t write as they were unaware of their blood groups. For both cases, we will get a to NA value in the database as the data is not available. In this article, we will learn how we can remove NA values using the R Programming Language dplyr package.

Dplyr package

Dplyr package in R is a popular package for Data manipulation. It provides a set of functions to help in the process of filtering, summarizing, arranging, and transforming data frames. It provides pipes(%>%) to chain multiple operations together. This package is always a choice to use in Data Preparation for Machine Learning and Data Analytics. Dplyr also takes less response time for computers. To learn more about the Dplyr package in R, check the article on Dplyr-package-in-r-programming.

Different ways to remove rows with NA using dplyr package in R

Let’s create a data-frame in R with some valid values and NA values.

R




#Loads dplyr package
library("dplyr")
df=data.frame(id=c(1,2,3,4,NA),
              name=c('geek',NA,'geeky','geeks',NA),
              gender=c(NA,'M',NA,'F',NA)) 
print(df)


Output:

  id  name gender
1 1 geek <NA>
2 2 <NA> M
3 3 geeky <NA>
4 4 geeks F
5 NA <NA> <NA>

Check for NA values

To check if our data frame consists of NA values or not, we will use is.na() function. This will return either TRUE or FALSE for each values. If we have any NaN value that also comes under NA but the reverse is not true.

R




colSums(is.na(df))


Output:


id name gender
1 2 3

It is showing the count of column wise missing values in dataset.

As we can see there are total 6 missing values in the data frame.

Remove all rows with NA values

The functions we are going use for this example are,

na.omit()

complete.cases()

rowSums()

1.Using na.omit()

The na.omit() function removes all the rows which has any NA value in a given data frame. Here we will be saving the result in a new data frame without affecting the original data frame.

R




df_1 <- na.omit(df)
print(df_1)


Output:

  id  name gender
4 4 geeks F

rows with id 4 was the only row which didn’t have any NA values, other than this all other rows have been dropped.

2. Using complete.cases()

The function complete.cases() gives a output without the rows which has at least one NA value.

R




print(df[complete.cases(df), ] )


Output:

  id  name gender
4 4 geeks F

We got only one row as output, as other rows in the data frame had at least one NA value.

3. Using rowSums()

Here we will be taking out all those values which doesn’t have a NA value by checking the sum of NA values in each individual rows.

R




print(df[rowSums(is.na(df)) == 0, ]  )


Output:

  id  name gender
4 4 geeks F

This checks for rows having number of NA value equal to 0. Then it prints all those rows.

Remove rows with all NA values

Till now we were removing all the rows which consist any NA value, which may not be benificial for some cases as this will lead to loss of crucial data. So, here we will delete only those rows which has all values as NA as it doesn’t contributing in any way for out data frame.

Here will be using,

rowSums() with ncol

rowSums() with filter()

1. Using rowSums() with ncol

Here we are taking the count of all NA values in a row and then comparing it with the number of columns available in the data frame. If both the value matches then you know for sure that that particular row has all values as NA and we remove it.

R




print(df[rowSums(is.na(df)) != ncol(df), ])


Output:

  id  name gender
1 1 geek <NA>
2 2 <NA> M
3 3 geeky <NA>
4 4 geeks F

If you check the output we have got rows which contains NA values but all values in the row are not NA. We had a row with id 5 which has been deleted as it contained all values as NA.

2. Using rowSums() with filter()

The filter() function of dplyr package in R is used to filter a data frame using some conditions. Here we are using the same logic we have used in using ncol() function. We will filter the data frame by comparing the values of column count and number of NA values in a row.

R




print(filter(df, rowSums(is.na(df)) != ncol(df)))


Output:

  id  name gender
1 1 geek <NA>
2 2 <NA> M
3 3 geeky <NA>
4 4 geeks F

As we can see the row with all values as NA is removed from the data frame.

Conclusion

NA values left behind in a dataset due to technical error of human error, but taking care of NA values in data preparation phase is important before we use the dataset for analysis of Model training. While it is a good to remove unnecessary data from the data frame but removing rows with few NA values can lead to loss of crucial data. We should use some approximation method to fill NA values and remove those rows where all the values are NA.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads