Open In App

Detect and Remove the Outliers using Python

Improve
Improve
Like Article
Like
Save
Share
Report

Outliers, deviating significantly from the norm, can distort measures of central tendency and affect statistical analyses. The piece explores common causes of outliers, from errors to intentional introduction, and highlights their relevance in outlier mining during data analysis.

The article delves into the significance of outliers in data analysis, emphasizing their potential impact on statistical results.

What is Outlier?

An Outlier is a data item/object that deviates significantly from the rest of the (so-called normal) objects. Identifying outliers is important in statistics and data analysis because they can have a significant impact on the results of statistical analyses. The analysis for outlier detection is referred to as outlier mining.

Outliers can skew the mean (average) and affect measures of central tendency, as well as influence the results of tests of statistical significance.

How Outliers are Caused?

Outliers can be caused by a variety of factors, and they often result from genuine variability in the data or from errors in data collection, measurement, or recording. Some common causes of outliers are:

  • Measurement errors: Errors in data collection or measurement processes can lead to outliers.
  • Sampling errors: In some cases, outliers can arise due to issues with the sampling process.
  • Natural variability: Inherent variability in certain phenomena can also lead to outliers. Some systems may exhibit extreme values due to the nature of the process being studied.
  • Data entry errors: Human errors during data entry can introduce outliers.
  • Experimental errors: In experimental settings, anomalies may occur due to uncontrolled factors, equipment malfunctions, or unexpected events.
  • Sampling from multiple populations: Data is inadvertently combined from multiple populations with different characteristics.
  • Intentional outliers: Outliers are introduced intentionally to test the robustness of statistical methods.

Outlier Detection And Removal

Here pandas data frame is used for a more realistic approach as real-world projects need to detect the outliers that arose during the data analysis step, the same approach can be used on lists and series-type objects.

Dataset Used For Outlier Detection

The dataset used in this article is the Diabetes dataset and it is preloaded in the Sklearn library.

Python3




# Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt
 
# Load the dataset
diabetics = load_diabetes()
 
# Create the dataframe
column_name = diabetics.feature_names
df_diabetics = pd.DataFrame(diabetics.data)
df_diabetics.columns = column_name
print(df_diabetics.head())


 Output:

        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  
0 -0.002592  0.019907 -0.017646  
1 -0.039493 -0.068332 -0.092204  
2 -0.002592  0.002861 -0.025930  
3  0.034309  0.022688 -0.009362  
4 -0.002592 -0.031988 -0.046641 

Outliers can be detected using visualization, implementing mathematical formulas on the dataset, or using the statistical approach. All of these are discussed below. 

Visualizing and Removing Outliers Using Box Plot

It captures the summary of the data effectively and efficiently with only a simple box and whiskers. Boxplot summarizes sample data using 25th, 50th, and 75th percentiles. One can just get insights(quartiles, median, and outliers) into the dataset by just looking at its boxplot.

Python3




# Box Plot
import seaborn as sns
sns.boxplot(df_diabetics['bmi'])


Output:

Outliers present in the bmi columns

Outliers present in the bmi columns

In the above graph, can clearly see that values above 10 are acting as outliers.

Python




import seaborn as sns
import matplotlib.pyplot as plt
 
 
def removal_box_plot(df, column, threshold):
    sns.boxplot(df[column])
    plt.title(f'Original Box Plot of {column}')
    plt.show()
 
    removed_outliers = df[df[column] <= threshold]
 
    sns.boxplot(removed_outliers[column])
    plt.title(f'Box Plot without Outliers of {column}')
    plt.show()
    return removed_outliers
 
 
threshold_value = 0.12
 
no_outliers = removal_box_plot(df_diabetics, 'bmi', threshold_value)


Output:

download

Box Plot

Visualizing and Removing Outliers Using Scatterplot

It is used when you have paired numerical data and when your dependent variable has multiple values for each reading independent variable, or when trying to determine the relationship between the two variables. In the process of utilizing the scatter plot, one can also use it for outlier detection.

To plot the scatter plot one requires two variables that are somehow related to each other. So here, ‘Proportion of non-retail business acres per town’ and ‘Full-value property-tax rate per $10,000’ are used whose column names are “INDUS” and “TAX” respectively.

Python3




fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(df_diabetics['bmi'], df_diabetics['bp'])
ax.set_xlabel('(body mass index of people)')
ax.set_ylabel('(bp of the people )')
plt.show()


Output:

Scatter plot of bp and bmi 

Looking at the graph can summarize that most of the data points are in the bottom left corner of the graph but there are few points that are exactly opposite that is the top right corner of the graph. Those points in the top right corner can be regarded as Outliers.

Using approximation can say all those data points that are x>20 and y>600 are outliers. The following code can fetch the exact position of all those points that satisfy these conditions. 

Removal of Outliers in BMI and BP Column Combined 

Here, NumPy’s np.where() function is used to find the positions (indices) where the condition (df_diabetics['bmi'] > 0.12) & (df_diabetics['bp'] < 0.8) is true in the DataFrame df_diabetics. The condition checks for outliers where ‘bmi’ is greater than 0.12 and ‘bp’ is less than 0.8. The output provides the row and column indices of the outlier positions in the DataFrame.

Python3




import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
 
outlier_indices = np.where((df_diabetics['bmi'] > 0.12) & (df_diabetics['bp'] < 0.8))
 
no_outliers = df_diabetics.drop(outlier_indices[0])
 
# Scatter plot without outliers
fig, ax_no_outliers = plt.subplots(figsize=(6, 4))
ax_no_outliers.scatter(no_outliers['bmi'], no_outliers['bp'])
ax_no_outliers.set_xlabel('(body mass index of people)')
ax_no_outliers.set_ylabel('(bp of the people )')
plt.show()


Output:

scatterplot

Scatter plot

The outliers have been removed successfully.

Z-score

Z- Score is also called a standard score. This value/score helps to understand that how far is the data point from the mean. And after setting up a threshold value one can utilize z score values of data points to define the outliers.

Zscore = (data_point -mean) / std. deviation

In this example, we are calculating the Z scores for the ‘age’ column in the DataFrame df_diabetics using the zscore function from the SciPy stats module. The resulting array z contains the absolute Z scores for each data point in the ‘age’ column, indicating how many standard deviations each value is from the mean.

Python3




from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df_diabetics['age']))
print(z)


Output:

0      0.800500
1      0.039567
2      1.793307
3      1.872441
4      0.113172
         ...   
437    0.876870
438    0.115937
439    0.876870
440    0.956004
441    0.956004
Name: age, Length: 442, dtype: float64

Now to define an outlier threshold value is chosen which is generally 3.0. As 99.7% of the data points lie between +/- 3 standard deviation (using Gaussian Distribution approach).

Removal of Outliers with Z-Score

Let’s remove rows where Z value is greater than 2.

In this example, we sets a threshold value of 2 and then uses NumPy’s np.where() to identify the positions (indices) in the Z-score array z where the absolute Z score is greater than the specified threshold (2). It prints the positions of the outliers in the ‘age’ column based on the Z-score criterion.

Python3




import numpy as np
 
threshold_z = 2
 
outlier_indices = np.where(z > threshold_z)[0]
no_outliers = df_diabetics.drop(outlier_indices)
print("Original DataFrame Shape:", df_diabetics.shape)
print("DataFrame Shape after Removing Outliers:", no_outliers.shape)


Output:

Original DataFrame Shape: (442, 10)
DataFrame Shape after Removing Outliers: (426, 10)

IQR (Inter Quartile Range) 

IQR (Inter Quartile Range) Inter Quartile Range approach to finding the outliers is the most commonly used and most trusted approach used in the research field.

IQR = Quartile3 – Quartile1

Syntax: numpy.percentile(arr, n, axis=None, out=None) 
Parameters : 

  • arr :input array.
  • n : percentile value.

In this example, we are calculating the interquartile range (IQR) for the ‘bmi’ column in the DataFrame df_diabetics. It first computes the first quartile (Q1) and third quartile (Q3) using the midpoint method, then calculates the IQR as the difference between Q3 and Q1, providing a measure of the spread of the middle 50% of the data in the ‘bmi’ column.

Python3




# IQR
Q1 = np.percentile(df_diabetics['bmi'], 25, method='midpoint')
Q3 = np.percentile(df_diabetics['bmi'], 75, method='midpoint')
IQR = Q3 - Q1
print(IQR)


Output

0.06520763046978838

To define the outlier base value is defined above and below dataset’s normal range namely Upper and Lower bounds, define the upper and the lower bound (1.5*IQR value is considered) :

upper = Q3 +1.5*IQR

lower = Q1 – 1.5*IQR

In the above formula as according to statistics, the 0.5 scale-up of IQR (new_IQR = IQR + 0.5*IQR) is taken, to consider all the data between 2.7 standard deviations in the Gaussian Distribution.

Python3




# Above Upper bound
upper = Q3+1.5*IQR
upper_array = np.array(df_diabetics['bmi'] >= upper)
print("Upper Bound:", upper)
print(upper_array.sum())
 
# Below Lower bound
lower = Q1-1.5*IQR
lower_array = np.array(df_diabetics['bmi'] <= lower)
print("Lower Bound:", lower)
print(lower_array.sum())


Output:
Upper Bound: 0.12879000811776306
3
Lower Bound: -0.13204051376139045
0

Outlier Removal in Dataset using IQR

In this example, we are using the interquartile range (IQR) method to detect and remove outliers in the ‘bmi’ column of the diabetes dataset. It calculates the upper and lower limits based on the IQR, identifies outlier indices using Boolean arrays, and then removes the corresponding rows from the DataFrame, resulting in a new DataFrame with outliers excluded. The before and after shapes of the DataFrame are printed for comparison.

Python3




# Importing
import sklearn
from sklearn.datasets import load_diabetes
import pandas as pd
 
# Load the dataset
diabetes = load_diabetes()
 
# Create the dataframe
column_name = diabetes.feature_names
df_diabetes = pd.DataFrame(diabetes.data)
df_diabetes .columns = column_name
df_diabetes .head()
print("Old Shape: ", df_diabetes.shape)
 
''' Detection '''
# IQR
# Calculate the upper and lower limits
Q1 = df_diabetes['bmi'].quantile(0.25)
Q3 = df_diabetes['bmi'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
 
# Create arrays of Boolean values indicating the outlier rows
upper_array = np.where(df_diabetes['bmi'] >= upper)[0]
lower_array = np.where(df_diabetes['bmi'] <= lower)[0]
 
# Removing the outliers
df_diabetes.drop(index=upper_array, inplace=True)
df_diabetes.drop(index=lower_array, inplace=True)
 
# Print the new shape of the DataFrame
print("New Shape: ", df_diabetes.shape)


Output:

Old Shape:  (442, 10)
New Shape:  (439, 10)

Conclusion

In conclusion, Visualization tools like box plots and scatter plots aid in identifying outliers, and mathematical methods such as Z-scores and Inter Quartile Range (IQR) offer robust approaches.

Frequently Asked Questions on Outlier Removal

Q. What is removing outliers in machine learning?

Removing outliers involves excluding data points significantly deviating from the norm to enhance model accuracy and generalization on new data.

Q. What are the techniques to remove outliers?

Common techniques include visualization tools (box plots, scatter plots), mathematical methods (Z-scores, IQR), and threshold-based filtering.

Q. What is the mean if the outlier is removed?

Removing outliers influences the mean, reducing its sensitivity to extreme values and providing a more representative measure of central tendency.

Q. Why remove outliers from data?

Outliers can distort statistical analyses, affecting mean, variance, and other measures. Removal improves model performance and data accuracy.

Q. What are different types of outliers in machine learning?

Outliers include global outliers (deviate from entire dataset) and local outliers (anomalous within specific subgroups), influencing data integrity.



Last Updated : 24 Jan, 2024
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads