Open In App

How to Remove Outliers from Multiple Columns in R DataFrame?

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will discuss how to remove outliers from Multiple Columns in the R Programming Language.

To remove outliers from a data frame, we use the Interquartile range (IQR) method. This method uses the first and third quantile values to determine whether an observation is an outlier to not.

If an observation is 1.5 times the interquartile range greater than the third quartile or 1.5 times the interquartile range less than the first quartile it is considered an outlier. 

Remove Outliers from Multiple Columns in R

To find an outlier in the R Language we use the following function, where we first calculate the first and third quantiles of the observation by using the quantile() function. Then we calculate their difference as an interquartile range.

if an observation is 1.5 times the interquartile range greater than the third quartile or 1.5 times the interquartile range less than the first quartile it returns true.

Syntax:

detect_outlier <- function(x) {

 Quantile1 <- quantile(x, probs=.25)

 Quantile3 <- quantile(x, probs=.75)

 IQR = Quantile3-Quantile1

x > Q3 + (iqr*1.5) | x < Q1 – (iqr*1.5) }

Then once the outlier is identified we remove the outlier by testing them with the above function.

Example 1:

Here, is an example, where we remove outliers from three columns of the data frame.

R




# create sample data frame
sample_data <- data.frame(x=c(10, 8, 120, 14, 11, 90, 13, 15, 200, 25, 5),
                          y=c(400, 35, 50, 704, 80, 55, 900, 75, 60, 500, 10),
                          z=c(10, 300, 20, 90, 800, 70, 5, 850, 75, 20, 30))
print("Display original dataframe")
print(sample_data)
 
# create detect outlier function
detect_outlier <- function(x) {
   
  # calculate first quantile
  Quantile1 <- quantile(x, probs=.25)
   
  # calculate third quantile
  Quantile3 <- quantile(x, probs=.75)
   
  # calculate interquartile range
  IQR = Quantile3 - Quantile1
   
  # return true or false
  x > Quantile3 + (IQR * 1.5) | x < Quantile1 - (IQR * 1.5)
}
 
# create remove outlier function
remove_outlier <- function(dataframe, columns = names(dataframe)) {
   
  # for loop to traverse in columns vector
  for (col in columns) {
     
    # remove observation if it satisfies outlier function
    dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
  }
   
  # return dataframe
  print("Remove outliers")
  print(dataframe)
}
 
remove_outlier(sample_data, c('x', 'y', 'z'))


Output:

[1] "Display original dataframe"

x y z
1 10 400 10
2 8 35 300
3 120 50 20
4 14 704 90
5 11 80 800
6 90 55 70
7 13 900 5
8 15 75 850
9 200 60 75
10 25 500 20
11 5 10 30

[1] "Remove outliers"
x y z
1 10 400 10
2 8 35 300
3 120 50 20
4 14 704 90
6 90 55 70
7 13 900 5
10 25 500 20
11 5 10 30

Example 2:

Here, is an example, where we remove outliers from four columns of the data frame.

R




# create sample data frame
sample_data <- data.frame(x=c(-1, 2, 3, 4, 3, 2, 3, 4, 4, 5, 10),
                          y=c(-4, 3, 5, 7, 8, 5, 9, 7, 6, 5, 10),
                          z=c(-1, 3, 2, 9, 8, 7, 0, 8, 7, 2, 13),
                          w=c(10, 0, 1, 0, 1, 0, 1, 0, 2, 2, 10))
print("Display original dataframe")
print(sample_data)
 
# create detect outlier function
detect_outlier <- function(x) {
   
  # calculate first quantile
  Quantile1 <- quantile(x, probs=.25)
   
  # calculate third quantile
  Quantile3 <- quantile(x, probs=.75)
   
  # calculate inter quartile range
  IQR = Quantile3 - Quantile1
   
  # return true or false
  x > Quantile3 + (IQR * 1.5) | x < Quantile1 - (IQR * 1.5)
}
 
# create remove outlier function
remove_outlier <- function(dataframe, columns = names(dataframe)) {
   
  # for loop to traverse in columns vector
  for (col in columns) {
     
    # remove observation if it satisfies outlier function
    dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
  }
   
  # return dataframe
  print("Remove outliers")
  print(dataframe)
}
 
remove_outlier(sample_data, c('x', 'y', 'z', 'w'))


Output:

  [1] "Display original dataframe"

x y z w
1 -1 -4 -1 10
2 2 3 3 0
3 3 5 2 1
4 4 7 9 0
5 3 8 8 1
6 2 5 7 0
7 3 9 0 1
8 4 7 8 0
9 4 6 7 2
10 5 5 2 2
11 10 10 13 10

[1] "Remove outliers"
x y z w
2 2 3 3 0
3 3 5 2 1
4 4 7 9 0
5 3 8 8 1
6 2 5 7 0
7 3 9 0 1
8 4 7 8 0
9 4 6 7 2
10 5 5 2 2


Last Updated : 15 Dec, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads