How to remove a subset from a DataFrame in R ?

Last Updated : 23 May, 2021

A subset is a combination of cells that form a smaller data frame formed out from the original data frame. A set of rows and columns can be removed from the original data frame to reduce a part of the data frame. The subset removal can be based on constraints to which rows and columns are subjected to. In this article, we will see how to remove subset from a DataFrame in R Programming Language.

Method 1: Using anti_join() method.

anti_join() method in this package is used to return all the rows from the first data frame with no matching values in y, keeping just columns from the first data frame. It is basically a selection and filter tool. The row numbers of the original data frame are not retained in the result returned.

Syntax: anti_join ( x , y , by = c(..))

Arguments :

x : The first data frame

y : The second data frame

by (Optional ) : To consider which column as the key for filtering data.

Returns : The first data frame rows that are not in second data frame.

Code:

R

# loading the library 
library("dplyr") 
  
# declaring data frame 
data_frame <- data.frame(col1 = c(2, 4, 6, 10), 
                         col2 = c(4, 6, 8, 5), 
                         col3 = c(8, 10, 12, 20), 
                         col4 = letters[1 : 4]) 
  
print ("Original Dataframe") 
print (data_frame) 
  
# creating subset dataframe 
subset <- data.frame(col1 = c(2 , 6), 
                     col2 = c(4 , 8)) 
  
# removing subset data frame 
data_frame_mod <- anti_join(data_frame,subset) 
print ("Modified Dataframe") 
print (data_frame_mod) 

Output:

[1] "Original Dataframe" 
col1 col2 col3 col4 
1    2    4    8    a 
2    4    6   10    b 
3    6    8   12    c 
4   10    5   20    d 
[1] "Modified Dataframe" 
col1 col2 col3 col4 
1    4    6   10    b 
2   10    5   20    d

In case, the second data frame columns belong to different rows of the first data frame, we can specify the column values to take, using the “by” argument in the anti_join() method.

R

library("dplyr") 
  
# declaring data frame 
data_frame <- data.frame(col1 = c(2, 4, 6, 10), 
                         col2 = c(4, 6, 8, 5), 
                         col3 = c(8, 10, 12, 20), 
                         col4 = letters[1 : 4]) 
print ("Original Dataframe") 
print (data_frame) 
subset <- data.frame(col1 = c(2 , 4), 
                     col4 = c("a" , "d") ) 
data_frame_mod <- anti_join(data_frame, 
                            subset, by = "col4") 
print ("Modified Dataframe") 
print (data_frame_mod) 

Output:

[1] "Original Dataframe" 
col1 col2 col3 col4 
1    2    4    8    a 
2    4    6   10    b 
3    6    8   12    c 
4   10    5   20    d 
[1] "Modified Dataframe" 
col1 col2 col3 col4 
1    4    6   10    b 
2    6    8   12    c

Method 2: Using %in% operator

The %in% operator is used to check for the existence of a value in the vector. It returns a logical value, in case the value is present, else False.

val %in% vec

The particular column of the first data frame is checked for values in the second data frame, and the rows are returned which are not present in the second data frame. The row numbers of the original data frame are retained during the application of this operator.

R

library("dplyr") 
  
# declaring data frame 
data_frame <- data.frame(col1 = c(2, 4, 6, 10), 
                         col2 = c(4, 6, 8, 5), 
                         col3 = c(8, 10, 12, 20), 
                         col4 = letters[1 : 4]) 
  
print ("Original Dataframe") 
print (data_frame) 
  
# creating second data frame 
subset <- data.frame(col1 = c(2 , 4), 
                     col2 = c("a" , "d")) 
data_frame_mod <- data_frame[data_frame$col4 %in% subset$col2, ] 
print ("Modified Dataframe") 
print (data_frame_mod) 

Output:

[1] "Original Dataframe"
  col1 col2 col3 col4
1    2    4    8    a
2    4    6   10    b
3    6    8   12    c
4   10    5   20    d
[1] "Modified Dataframe"
  col1 col2 col3 col4
2   4    6   10    b
3   6    8   12    c

Suggest improvement

How to select a subset of DataFrame in R

Share your thoughts in the comments