How to filter R DataFrame by values in a column?
In R Programming Language, dataframe columns can be subjected to constraints, and produce smaller subsets. However, while the conditions are applied, the following properties are maintained :
- Rows are considered to be a subset of the input.
- Rows in the subset appear in the same order as the original dataframe.
- Columns remain unmodified.
- The number of groups may be reduced, based on conditions.
- dataframe attributes are preserved during data filter.
Method 1 : Using dataframe indexing
Any dataframe column in the R programming language can be referenced either through its name df$col-name or using its index position in the dataframe df[col-index]. The cell values of this column can then be subjected to constraints, logical or comparative conditions, and then a dataframe subset can be obtained. These conditions are applied to the row index of the dataframe so that the satisfied rows are returned.
- Selection based on a check of missing values or NA
Cells in dataframe can contain missing values or NA as its elements, and they can be verified using is.na() method in R language.
Example:
R
# declaring a dataframe data_frame = data.frame (col1 = c ( NA , "b" , NA , "e" , "e" ) , col2 = c (0,2,1,4,5), col3= c ( TRUE , FALSE , FALSE , TRUE , TRUE )) print ( "Original dataframe" ) print (data_frame) # checking which values are not NA data_frame_mod <- data_frame[! is.na (data_frame$col1),] print ( "Modified dataframe" ) print (data_frame_mod) |
Output
[1] “Original dataframe”
col1 col2 col3
1 <NA> 0 TRUE
2 b 2 FALSE
3 <NA> 1 FALSE
4 e 4 TRUE
5 e 5 TRUE
[1] “Modified dataframe”
col1 col2 col3
2 b 2 FALSE
4 e 4 TRUE
5 e 5 TRUE
- Selection based on a single comparative condition on a column
Column values can be subjected to constraints to filter and subset the data. The values can be mapped to specific occurrences or within a range.
Example:
R
# declaring a dataframe data_frame = data.frame (col1 = c ( "b" , "b" , "e" , "e" , "e" ) , col2 = c (0,2,1,4,5), col3= c ( TRUE , FALSE , FALSE , TRUE , TRUE )) print ( "Original dataframe" ) print (data_frame) # checking which columns have col3 # value equivalent to true data_frame_mod <- data_frame[data_frame$col3== TRUE ,] print ( "Modified dataframe" ) print (data_frame_mod) |
Output
[1] “Original dataframe”
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
3 e 1 FALSE
4 e 4 TRUE
5 e 5 TRUE
[1] “Modified dataframe”
col1 col2 col3
1 b 0 TRUE
4 e 4 TRUE
5 e 5 TRUE
- Selection based on multiple comparative conditions on a column
Column values can be subjected to constraints to filter and subset the data. The conditions can be combined by logical & or | operators. The %in% operator is used here, in order to check values that match to any of the values within a specified vector.
Example:
R
# declaring a dataframe data_frame = data.frame (col1 = c ( "b" , "b" , "d" , "e" , "e" ) , col2 = c (0,2,1,4,5), col3= c ( TRUE , FALSE , FALSE , TRUE , TRUE )) print ( "Original dataframe" ) print (data_frame) # checking which values of col1 # are equivalent to b or e data_frame_mod <- data_frame[data_frame$col1 % in % c ( "b" , "e" ),] print ( "Modified dataframe" ) print (data_frame_mod) |
Output
[1] “Original dataframe”
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
3 d 1 FALSE
4 e 4 TRUE
5 e 5 TRUE
[1] “Modified dataframe”
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
4 e 4 TRUE
5 e 5 TRUE
Method 2 : Using dplyr library
The dplyr library can be installed and loaded into the working space which is used to perform data manipulation.
The filter() function is used to produce a subset of the dataframe, retaining all rows that satisfy the specified conditions. The filter() method in R can be applied to both grouped and ungrouped data. The expressions include comparison operators (==, >, >= ) , logical operators (&, |, !, xor()) , range operators (between(), near()) as well as NA value check against the column values. The subset dataframe has to be retained in a separate variable.
Syntax:
filter(df , cond)
Parameter :
df – The dataframe object
cond – The condition to filter the data upon
Example:
R
library ( "dplyr" ) # declaring a dataframe data_frame = data.frame (col1 = c ( "b" , "b" , "d" , "e" , "e" ) , col2 = c (0,2,1,4,5), col3= c ( TRUE , FALSE , FALSE , TRUE , TRUE )) print ( "Original dataframe" ) print (data_frame) # checking which values of col1 # are equivalent to b or e data_frame_mod <- filter (data_frame,col2>1) print ( "Modified dataframe" ) print (data_frame_mod) |
Output
[1] “Original dataframe”
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
3 d 1 FALSE
4 e 4 TRUE
5 e 5 TRUE
[1] “Modified dataframe”
col1 col2 col3
1 b 2 FALSE
2 e 4 TRUE
3 e 5 TRUE
Also, the values can be checked using the %in% operator to match the column cell values with the elements contained in the input specified vector.
Example:
R
library ( "dplyr" ) # declaring a dataframe data_frame = data.frame (col1 = c ( "b" , "b" , "d" , "e" , "e" ) , col2 = c (0,2,1,4,5), col3= c ( TRUE , FALSE , FALSE , TRUE , TRUE )) print ( "Original dataframe" ) print (data_frame) # checking which values of col1 # are equivalent to b or e data_frame_mod <- filter (data_frame,col1 % in % c ( "b" , "e" )) print ( "Modified dataframe" ) print (data_frame_mod) |
Output
[1] “Original dataframe”
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
3 d 1 FALSE
4 e 4 TRUE
5 e 5 TRUE
[1] “Modified dataframe”
col1 col2 col3
1 b 0 TRUE
2 b 2 FALSE
4 e 4 TRUE
5 e 5 TRUE
Please Login to comment...