Skip to content
Related Articles

Related Articles

Improve Article

Filter DataFrame columns in R by given condition

  • Last Updated : 24 Jun, 2021
Geek Week

In this article, we are going to see how to select DataFrame columns in R Programming Language by given condition. R data frame columns can be subjected to constraints, and produce smaller subsets. However, while the conditions are applied, the following properties are maintained :

  • Rows of the data frame remain unmodified.
  • Data frame attributes are preserved.
  • Output columns are a subset of input columns,

Method 1: Using indexing methods

The aggregate methods can be applied over the columns of the data frame, and the columns satisfying the evaluation of expressions are returned as an output. The resultant data frame is a subset of the data frame where all rows are retained for the selected columns. The modified data frame has to be stored in a new variable in order to retain changes. For instance, colSums() is used to calculate the sum of all elements belonging to a column. 

Example 1: The following program returns the columns where the sum of its elements is greater than 10 : 

R






# declaring a data frame
data_frame = data.frame(col1 = c(0 : 4) ,
                        col2 = c(0, 2, -1, 4, 8),
                        col3 = c(9 : 13))
 
print ("Original dataframe")
print (data_frame)
 
# where column sum is greater than 10
data_frame_mod <- data_frame[colSums(data_frame)>10]
print ("Modified dataframe")
print (data_frame_mod)

Output:

[1] "Original dataframe"
 col1 col2 col3
1    0    0    9
2    1    2   10
3    2   -1   11
4    3    4   12
5    4    8   13
[1] "Modified dataframe"
 col2 col3
1    0    9
2    2   10
3   -1   11
4    4   12
5    8   13

Example 2: The below program checks if the modulo operation of col1 value with 2 is not equal to 0. 

R




# declaring a data frame
data_frame = data.frame(col1 = c(0 : 4) ,
                        col2 = c(0, 2, -1, 4, 8),
                        col3 = c(9 : 13))
 
print ("Original dataframe")
print (data_frame)
 
# where column sum is greater than 10
data_frame_mod <- data_frame[data_frame$col1 %% 2 != 0, ]
print ("Modified dataframe")
print (data_frame_mod)

Output:

[1] "Original dataframe"
 col1 col2 col3
1    0    0    9
2    1    2   10
3    2   -1   11
4    3    4   12
5    4    8   13
[1] "Modified dataframe"
 col1 col2 col3
2    1    2   10
4    3    4   12

Method 2: Using dplyr package

The dplyr library can be installed and loaded into the working space which is used to perform data manipulation.

install.packages("dplyr")

The select_if() function is used to produce a subset of the data frame, retaining all rows that satisfy the specified conditions. The select_if() method in R can be applied to both grouped as well as ungrouped data. The expressions include comparison operators (==, >, >= ) , logical operators (&, |, !, xor()) , range operators (between(), near()) as well as NA value check against the column values. The subset data frame has to be retained in a separate variable. 

df %>% select_if(condition)

Example 1: The following program returns the numerical columns of the dataframe, when subjected to the select_if() method :



R




library ("dplyr")
 
# declaring a data frame
data_frame = data.frame(col1 = c("b", "b", "d", "e", "e") ,
                        col2 = c(0, 2, 1, 4, 5),
                        col3 = c(TRUE, FALSE, FALSE, TRUE, TRUE)
                        col4 = c(1 : 5))
 
print ("Original dataframe")
print (data_frame)
print ("Modified dataframe")
 
# selecting numeric columns
data_frame %>% select_if(is.numeric)

Output

[1] "Original dataframe"
  col1 col2 col3   col4
1    b    0  TRUE    1
2    b    2  FALSE    2
3    d    1  FALSE    3
4    e    4  TRUE    4
5    e    5  TRUE    5
[1] "Modified dataframe"
   col2 col4
1    0  1
2    2  2
3    1  3
4    4  4
5    5  5

Example 2: The following program returns the columns where the sum of its elements is lesser than 10 : 

R




library ("dplyr")
 
# declaring a data frame
data_frame = data.frame(col1 = c(-1, -2, -2, 0, 0) ,
                        col2 = c(0, 2, 1, 4, 5),
                        col3 = c(1 : 5))
print ("Original dataframe")
print (data_frame)
print ("Modified dataframe")
 
# select columns where column sum is less than 10
data_frame %>% select_if(colSums(data_frame) < 10)

Output:

[1] "Original dataframe" 
  col1 col2 col3 
1   -1    0    1 
2   -2    2    2 
3   -2    1    3 
4    0    4    4 
5    0    5    5  
[1] "Modified dataframe" 
  col1
1   -1 
2   -2 
3   -2 
4    0 
5    0

Method 3: Using subset() method

The subset() method can be used to return a set of rows that satisfy the specified constraints. The subset() method doesn’t modify the order of occurrence of rows. 

Syntax: subset ( df , condition)

Arguments : 

  • df – The dataframe
  • condition – The constraints to be applied

%in% operator can be used to check if the value occurs in a vector of values. Returns a boolean value depending on whether the element exists or not.

val %in% vector

R




# declaring a data frame
data_frame = data.frame(col1 = c(0 : 4) ,
                        col2 = c(0, 2, -1, 4, 8),
                        col3 = c(9 : 13))
print ("Original dataframe")
print (data_frame)
 
# where column sum is greater than 10
data_frame_mod <- subset(data_frame,
                         col3 %in% c(9, 10, 13))
print ("Modified dataframe")
print (data_frame_mod)

Output:

[1] "Original dataframe"
 col1 col2 col3
1    0    0    9
2    1    2   10
3    2   -1   11
4    3    4   12
5    4    8   13
[1] "Modified dataframe"
 col1 col2 col3
1    0    0    9
2    1    2   10
5    4    8   13



My Personal Notes arrow_drop_up
Recommended Articles
Page :