Open In App

How to Remove Outliers from Multiple Columns in R DataFrame?

Last Updated : 15 Dec, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we will discuss how to remove outliers from Multiple Columns in the R Programming Language.

To remove outliers from a data frame, we use the Interquartile range (IQR) method. This method uses the first and third quantile values to determine whether an observation is an outlier to not.

If an observation is 1.5 times the interquartile range greater than the third quartile or 1.5 times the interquartile range less than the first quartile it is considered an outlier. 

Remove Outliers from Multiple Columns in R

To find an outlier in the R Language we use the following function, where we first calculate the first and third quantiles of the observation by using the quantile() function. Then we calculate their difference as an interquartile range.

if an observation is 1.5 times the interquartile range greater than the third quartile or 1.5 times the interquartile range less than the first quartile it returns true.

Syntax:

detect_outlier <- function(x) {

 Quantile1 <- quantile(x, probs=.25)

 Quantile3 <- quantile(x, probs=.75)

 IQR = Quantile3-Quantile1

x > Q3 + (iqr*1.5) | x < Q1 – (iqr*1.5) }

Then once the outlier is identified we remove the outlier by testing them with the above function.

Example 1:

Here, is an example, where we remove outliers from three columns of the data frame.

R




# create sample data frame
sample_data <- data.frame(x=c(10, 8, 120, 14, 11, 90, 13, 15, 200, 25, 5),
                          y=c(400, 35, 50, 704, 80, 55, 900, 75, 60, 500, 10),
                          z=c(10, 300, 20, 90, 800, 70, 5, 850, 75, 20, 30))
print("Display original dataframe")
print(sample_data)
 
# create detect outlier function
detect_outlier <- function(x) {
   
  # calculate first quantile
  Quantile1 <- quantile(x, probs=.25)
   
  # calculate third quantile
  Quantile3 <- quantile(x, probs=.75)
   
  # calculate interquartile range
  IQR = Quantile3 - Quantile1
   
  # return true or false
  x > Quantile3 + (IQR * 1.5) | x < Quantile1 - (IQR * 1.5)
}
 
# create remove outlier function
remove_outlier <- function(dataframe, columns = names(dataframe)) {
   
  # for loop to traverse in columns vector
  for (col in columns) {
     
    # remove observation if it satisfies outlier function
    dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
  }
   
  # return dataframe
  print("Remove outliers")
  print(dataframe)
}
 
remove_outlier(sample_data, c('x', 'y', 'z'))


Output:

[1] "Display original dataframe"

x y z
1 10 400 10
2 8 35 300
3 120 50 20
4 14 704 90
5 11 80 800
6 90 55 70
7 13 900 5
8 15 75 850
9 200 60 75
10 25 500 20
11 5 10 30

[1] "Remove outliers"
x y z
1 10 400 10
2 8 35 300
3 120 50 20
4 14 704 90
6 90 55 70
7 13 900 5
10 25 500 20
11 5 10 30

Example 2:

Here, is an example, where we remove outliers from four columns of the data frame.

R




# create sample data frame
sample_data <- data.frame(x=c(-1, 2, 3, 4, 3, 2, 3, 4, 4, 5, 10),
                          y=c(-4, 3, 5, 7, 8, 5, 9, 7, 6, 5, 10),
                          z=c(-1, 3, 2, 9, 8, 7, 0, 8, 7, 2, 13),
                          w=c(10, 0, 1, 0, 1, 0, 1, 0, 2, 2, 10))
print("Display original dataframe")
print(sample_data)
 
# create detect outlier function
detect_outlier <- function(x) {
   
  # calculate first quantile
  Quantile1 <- quantile(x, probs=.25)
   
  # calculate third quantile
  Quantile3 <- quantile(x, probs=.75)
   
  # calculate inter quartile range
  IQR = Quantile3 - Quantile1
   
  # return true or false
  x > Quantile3 + (IQR * 1.5) | x < Quantile1 - (IQR * 1.5)
}
 
# create remove outlier function
remove_outlier <- function(dataframe, columns = names(dataframe)) {
   
  # for loop to traverse in columns vector
  for (col in columns) {
     
    # remove observation if it satisfies outlier function
    dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
  }
   
  # return dataframe
  print("Remove outliers")
  print(dataframe)
}
 
remove_outlier(sample_data, c('x', 'y', 'z', 'w'))


Output:

  [1] "Display original dataframe"

x y z w
1 -1 -4 -1 10
2 2 3 3 0
3 3 5 2 1
4 4 7 9 0
5 3 8 8 1
6 2 5 7 0
7 3 9 0 1
8 4 7 8 0
9 4 6 7 2
10 5 5 2 2
11 10 10 13 10

[1] "Remove outliers"
x y z w
2 2 3 3 0
3 3 5 2 1
4 4 7 9 0
5 3 8 8 1
6 2 5 7 0
7 3 9 0 1
8 4 7 8 0
9 4 6 7 2
10 5 5 2 2


Similar Reads

Remove Outliers from Data Set in R
In this article, we will be looking at the approach to remove the Outliers from the data set using the in-built functions in the R programming language. Outliers are data points that don't fit the pattern of the rest of the data set. The best way to detect the outliers in the given data set is to plot the boxplot of the data set and the point locat
2 min read
How to select multiple DataFrame columns by name in R ?
In this article, we will discuss how to select multiple columns from a DataFrame by name in R Programming Language. To get multiple columns we will use the list data structure. By using a list we can pass the dataframe columns separated with a comma. Then, we can get list by using list() function Syntax: list(dataframe_name$column1,dataframe_name$c
1 min read
Calculate mean of multiple columns of R DataFrame
Mean is a numerical representation of the central tendency of the sample in consideration. In this article, we are going to calculate the mean of multiple columns of a dataframe in R Programming Language. Formula: Mean= sum of observations/total number of observations. Method 1: Using colMeans() function colMeans() this will return the column-wise
2 min read
Sum of Two or Multiple DataFrame Columns in R
In this article, we will discuss how to perform some of two and multiple dataframes columns in R programming language. Database in use: Sum of two columns The columns whose sum has to be calculated can be called through the $ operator and then we can perform the sum of two dataframe columns by using "+" operator. Syntax: dataframe$column1 + datafra
2 min read
Split DataFrame Variable into Multiple Columns in R
In this article, we will discuss how to split dataframe variables into multiple columns using R programming language. Method 1: Using do.call method The strsplit() method in R is used to split the specified column string vector into corresponding parts. The pattern is used to divide the string into subparts. Syntax: strsplit(str, pattern) Parameter
3 min read
How to Delete Multiple Columns in R DataFrame?
In this article, we will discuss how to delete multiple columns in R Programming Language. We can delete multiple columns in the R dataframe by assigning null values through the list() function. Syntax: data[ , c('column_name1', 'column_name2',...........,'column_nam en)] &lt;- list(NULL) where, data is the input dataframe Example: R program to cre
1 min read
How to Split Column Into Multiple Columns in R DataFrame?
In this article, we will discuss how to split a column from a data frame into multiple columns in the R programming Language. Method 1: Using str_split_fixed() function of stringr package library To split a column into multiple columns in the R Language, We use the str_split_fixed() function of the stringr package library. The str_split_fixed() fun
3 min read
Ignore Outliers in ggplot2 Boxplot in R
In this article, we will understand how we can ignore or remove outliers in ggplot2 Boxplot in R programming language. Removing/ ignoring outliers is generally not a good idea because highlighting outliers is generally one of the advantages of using box plots. However, sometimes extreme outliers, on the other hand, can alter the size and obscure ot
3 min read
Remove duplicate rows based on multiple columns using Dplyr in R
In this article, we will learn how to remove duplicate rows based on multiple columns using dplyr in R programming language. Dataframe in use: lang value usage 1 Java 21 21 2 C 21 21 3 Python 3 0 4 GO 5 99 5 RUST 180 44 6 Javascript 9 48 7 Cpp 12 53 8 Java 21 21 9 Julia 6 6 10 Typescript 0 8 11 Python 3 0 12 GO 6 6Removing duplicate rows based on t
4 min read
Remove Multiple Columns from data.table in R
In this article, we are going to see how to remove multiple columns from data.table in the R Programming language. Create data.table for demonstration: C/C++ Code # load the data.table package library(&quot;data.table&quot;) # create a data.table with 4 columns # they are id,name,age and address data = data.table(id = c(1,2,3) , name = c(&quot;srav
2 min read
Article Tags :