Outlier Detection in Logistic Regression

Outliers, data points that deviate significantly from the rest, can significantly impact the performance of logistic regression models. In this article we will explore various techniques for detecting and handling outliers in Logistic regression.

What are Outliers?

An outlier is an observation that falls far outside the typical range of other data points in a dataset. These anomalies can arise from errors in data collection, human mistakes, equipment malfunctions, or data transmission issues. Outliers can lead to:

Erroneous parameter estimation: Outliers can skew the model's understanding of the relationship between variables, leading to inaccurate parameter estimates.
Misclassification of outcomes: Outliers can cause the model to misclassify data points, reducing its overall accuracy and reliability.
Unreliable results and decisions: Models built with outliers can produce misleading results and consequently, influence poor decisions.

Outliers vs. Leverage Points vs. Influential Observations

It's important to distinguish between outliers and two related concepts:

Leverage points: These data points have extreme values on the independent variables (X-axis). While they may not be outliers themselves, they can have a significant influence on the model's fit due to their position in the feature space.
Influential observations: These data points, regardless of being outliers or leverage points, can significantly affect the estimated coefficients, standard errors, and t-values of the model.

In logistic regression, outliers with low leverage can still exert a substantial influence due to the non-linear relationship between the independent variables and the predicted probabilities.

Outlier Detection Techniques in Logistic Regression

Detecting and appropriately managing outliers is crucial for ensuring the accuracy and reliability of logistic regression analyses. Two common approaches for detecting outliers in logistic regression are:

Single-case deletion approach

The single-case deletion approach is one of the techniques of outlier detection, which involves removing individual outliers from the dataset one at a time. However, it suffers from two limitations in the presence of multiple outliers:

Masking: Masking occurs when the observation’s influence in the dataset is not immediately evident until one or more other observations are removed. Masking is an outlier effect that is hidden or masked by the presence of other outliers or extreme values in the dataset. This phenomenon can occur in the outlier deletion methods where the outliers are identified and removed sequentially such as the single-case deletion approach.
Swamping: Swamping occurs when the data points in the dataset are not outliers and are identified incorrectly due to the other unusual observations on the model. When the outlier detection methods are overly sensitive to extreme values or when the removal of genuine outliers leads to misclassification of other data points, the presence of swamping might be the reason for such activities.

Multiple-case Deletion approach

One-by-one or sequential detection of outliers in a single-case deletion approach may fall into the trap of masking and swamping effects. We can use a multiple-case deletion approach instead of a single-case deletion approach to overcome this issue. Even in the presence of masking effects, the multiple-case deletion approach aims to identify the multiple influential observations in the dataset.

There are two stages involved in this deletion approach:

A clean subset of data: We obtain the approximate clean subset of data that is said to be free from influential observation. This can be done when we implement a multiple-case deletion technique which in turn helps to remove the multiple influential observations at once rather than removing the outliers one by one like the single-case deletion technique.
Enhancing efficiency: We refine the detection rule to improve the efficiency of the outlier detection, which can help in accurately identifying the influential observations.

The multiple-case deletion approach generally leads to a more accurate identification of outliers compared to the single-case approach.

Handling Outliers

Once outliers are detected, several techniques can be used to address them:

Removing outliers: One of the techniques used to handle the outliers is to remove them from the dataset. However, removing outliers can potentially lead to the loss of valuable data. In such cases where the outliers are allocated to represent the valid data points, it may be appropriate to leave them unchanged.
Transformation: Transforming the variables is also one kind of outlier handling technique to get rid of the outliers. The general purpose of transforming the values is to reduce the effect of extreme values (outliers) present in the dataset. When the transformation is applied the outliers are brought closer to the rest of the data. This transformation can be done by using methods like scaling, Cube root normalization, Log transformation, and Box transformation.
Imputation: It is the process of replacing the missing values or outliers in the dataset with its estimated value. This estimated value can be generated by using mean, median, and zero values.
Robust estimators: The robust estimators are insensitive to outliers that mitigate their impact on statistical analyses. This estimator uses certain algorithms like robust regression and M-estimators. The robust regression handles the outlier by fitting the regression model that is insensitive to outliers.

Detection and Handling Outliers : Implementation

Step 1: Import the necessary libraries and load the dataset

Python3

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
#load the dataset
iris = load_iris()
X,y = iris.data, iris.target

Step 2: Introducing outliers randomly to the dataset

In this step, we introduce the outliers randomly to the dataset, since it is common method in outlier detection techniques. Ten indices are randomly chosen and adds the random noise to it, this simulates the presence of outliers in the real-world datasets.

Python3

np.random.seed(0)
outlier_indices = np.random.choice(range(len(X)), size = 10, replace=False)
X[outlier_indices]+= 50 * np.random.rand(10,4)

Step 3: Creating datasets with specified outlier treatment

Single-case deletion approach: In this approach we removed the outliers one by one from the dataset. We define the function that creates the copies of original X and y arrays that iteratively removes the outliers specified by the outlier_indicies.
Multiple-case deletion approach: In this approach all the outliers and their immediate neighbors are removed from the datasets, this helps to ensure that the data points surrounded by the outliers are not considered while training the model. By this approach we tend to get a cleaner dataset that can possibly improve the performance of the model.

Python3

def create_dataset(X, y, outlier_treatment, outlier_indices):
  #Single case
  if outlier_treatment == "single":
    X_no_outliers= np.copy(X)
    y_no_outliers = np.copy(y)
    for idx in outlier_indices:
      X_no_outliers = np.delete(X_no_outliers, idx, axis =0)
      y_no_outliers = np.delete(y_no_outliers, idx)
    X_train, X_test, y_train, y_test = train_test_split(X_no_outliers, y_no_outliers, test_size=0.2, random_state=42)
  #multiple
  elif outlier_treatment == "multiple":
    #remove all the outliers
    outlier_indices = np.concatenate((outlier_indices, outlier_indices + 1, outlier_indices+2))
    X_no_outliers =np.delete(X, outlier_indices, axis=0)
    y_no_outliers = np.delete(y, outlier_indices)
    #Split into training and testing dataset
    X_train, X_test, y_train, y_test = train_test_split(X_no_outliers, y_no_outliers, test_size =0.2, random_state=42)
  return X_train, X_test, y_train, y_test

Step 4: Training logistic regression

Python3

def train_logistic_regression(X_train, X_test, y_train, y_test):
  lr= LogisticRegression(max_iter=1000)
  lr.fit(X_train, y_train)
  y_pred = lr.predict(X_test)
  acc=accuracy_score(y_test,y_pred)
  return acc

Step 5: Evaluation

Using create_dataset function we call the function with single and multiple outlier treatment depending upon the specified approach. This function creates the datasets where outliers are handles differently based on the approach chosen.

Python3

X_train_single, X_test_single, y_train_single, y_test_single = create_dataset(X, y, "single", outlier_indices)
acc_single = train_logistic_regression(X_train_single, X_test_single, y_train_single, y_test_single)
X_train_multiple, X_test_multiple, y_train_multiple, y_test_multiple = create_dataset(X, y, "multiple", outlier_indices)
acc_multiple = train_logistic_regression(X_train_multiple, X_test_multiple, y_train_multiple, y_test_multiple)
print("Accuracy using single case deletion approach:", acc_single)
print("Accuracy using multiple case deletion approach:", acc_multiple)

Output:

Accuracy using single case deletion approach: 0.7857142857142857
Accuracy using multiple case deletion approach: 0.9583333333333334

Single-case deletion approach: This approach involves removing outliers one by one from the datasets and once it removed all the outliers, the model is trained using the modified dataset. In this case, our single-case deletion approach model yields the accuracy of 0.7857, thus indicating that the model's performance is relatively lower when the outliers are handled individually
Multiple-case deletion approach: This approach removes the outliers in batches or groups. In this case, out multiple-case deletion approach model yields the accuracy of 0.9583, thus indicating that the model's performance improves significantly when the outliers are removed in groups.

Thus, we can clearly see that multiple-case deletion approach is more effective in handling outliers compared to the single case deletion approach since it leads to a higher accuracy in the trained model.

Challenges of Outlier Detection

Some challenges in outlier detection:

Visualization limitations: When dealing with more than two predictor variables, simple visualization tools like scatter plots become less effective in identifying outliers.
Indirect methods: Unlike linear regression, logistic regression lacks direct outlier detection techniques. It relies on goodness-of-fit and residual analysis, which are primarily used for model assessment.
Data loss vs. model bias: Removing or downweighting outliers can lead to data loss, potentially discarding valuable information. On the other hand, keeping outliers can introduce bias into the model.

Conclusion

Outlier detection is a crucial aspect of logistic regression for ensuring accurate model predictions. Through this tutorial, we have gained knowledge about outlier detection techniques such as single and multiple case deletion approaches which play a huge role in detecting the potential outliers in the logistic regression.

Article Tags :

AI-ML-DS

Machine Learning

AI-ML-DS With Python