How to Handle Missing Data in Logistic Regression?

Logistic regression is a robust statistical method employed to model the likelihood of binary results. Nevertheless, real-world datasets frequently have missing values, presenting obstacles while fitting logistic regression models. Dealing with missing data effectively is essential to prevent skewed estimates and maintain the model's accuracy. In this article, we have discussed how can we handle missing data in logistic regression.

Table of Content

How to Handle Missing Data in Logistic Regression?
1. Handling Missing Data in Logistic Regression by Deletion
2. Handling Missing Data in Logistic Regression by Imputation
3. Handling Missing Data in Logistic Regression using Missingness Indicator

How to Handle Missing Data in Logistic Regression?

Handling missing data in logistic regression is important to ensure the accuracy of the model. Some of the strategies for handling mission data are discussed below:

Remove missing data
Imputation: Imputation involves replacing missing values with estimated values. Common imputation techniques include:
- Mean or median imputation
- Mode imputation
- Predictive imputation
Create a missingness indicator

Handling Missing Data in Logistic Regression by Deletion

In this method, we simply remove observations with missing values from the dataset. This approach is straightforward but may lead to loss of valuable information.

Pros of Handling Missing Data in Logistic Regression by Deletion

Simplicity: It's simple to apply and comprehend the deletion process. No further modeling procedures or sophisticated imputation techniques are needed.
Preservation of Data Structure: There is no need to change or manipulate the data because missing values are eliminated, maintaining the dataset's structure.

Cons Handling Missing Data in Logistic Regression by Deletion

Loss of Important Information: The deletion method's primary disadvantage is the information that is lost. It is possible to eliminate potentially significant patterns or relationships in the data by eliminating observations that have missing values.
Reduced Statistical Power: Deletion of observations might result in a smaller sample size and, thus, a lower level of statistical power. Less observations could lead to less accurate estimations and less trustworthy outcomes.

Implementation

A synthetic dataset with missing values is generated using NumPy's random functions.
The dataset includes 1000 samples and 5 features, with 20% missing values randomly inserted.
The dataset is split into training and testing sets using a 80-20 split ratio.
Observations with missing values are removed from the training set using boolean indexing.
A logistic regression model is trained on the modified training set without missing values.
The trained model's accuracy is evaluated on the testing set, excluding observations with missing values.
The output indicates the accuracy achieved by the logistic regression model trained using the deletion method for handling missing data.
In this specific run, the accuracy obtained is approximately 51.56%.
The achieved accuracy may be relatively low due to the loss of valuable information caused by the deletion of observations with missing values.

Python

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Step 1: Generate Synthetic Dataset with Missing Values
np.random.seed(0)
n_samples = 1000
n_features = 5
X = np.random.rand(n_samples, n_features)
y = np.random.randint(0, 2, n_samples)  # binary target variable
missing_mask = np.random.rand(n_samples, n_features) < 0.2  # 20% missing values
X[missing_mask] = np.nan

# Step 2: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Deletion Method:
# Remove observations with missing values
X_train_deleted = X_train[~np.isnan(X_train).any(axis=1)]
y_train_deleted = y_train[~np.isnan(X_train).any(axis=1)]

# Train logistic regression model
model_deleted = LogisticRegression()
model_deleted.fit(X_train_deleted, y_train_deleted)

# Evaluate model on test set
accuracy_deleted = model_deleted.score(X_test[~np.isnan(X_test).any(axis=1)], y_test[~np.isnan(X_test).any(axis=1)])
print("Accuracy with Deletion Method:", accuracy_deleted)

Output:

Accuracy with Deletion Method: 0.515625

The output reflects the accuracy of a logistic regression model trained on data with missing values removed (using the deletion method). With an accuracy of 51.56%, the model's performance is relatively low, likely due to the loss of valuable information from deleted observations, reducing the training data and hindering its ability to generalize to unseen data.

Handling Missing Data in Logistic Regression by Imputation

Imputation involves replacing missing values with estimated values. Common imputation techniques include mean imputation, median imputation, and K-nearest neighbors (KNN) imputation.

Pros of Handling Missing Data in Logistic Regression by Imputation

Preservation of Data Integrity: Imputation retains all available data points, preventing the loss of valuable information compared to deletion methods.
Maintenance of Sample Size: By replacing missing values with estimates, imputation ensures the dataset's original sample size is maintained, crucial for statistical power and enhancing predictive performance.
Bias Reduction: Imputation methods help mitigate bias in parameter estimates and standard errors by including incomplete cases, leading to more accurate and dependable model outcomes.

Cons of Handling Missing Data in Logistic Regression by Imputation

Bias Introduction: Imputation relies on assumptions about missing data, and inaccurate assumptions may introduce bias, potentially distorting results.
Variability Distortion: Imputation can artificially reduce observed variance if estimated values are not accurate, impacting the model's performance.
Complexity of Methods: Certain imputation techniques, like multiple imputation, can be computationally intensive and require careful selection and tuning, increasing the modeling process's complexity.

Implementation

Missing values in the dataset are imputed using the mean value of each feature.
The SimpleImputer class from scikit-learn is used with the strategy set to 'mean' for imputation.
A logistic regression model is trained on the training set with imputed missing values.
The LogisticRegression class from scikit-learn is used for model training.
The accuracy of the trained logistic regression model is evaluated on the testing set.
The output displays the accuracy achieved by the logistic regression model trained using the imputation method for handling missing data.
In this specific run, the accuracy obtained is approximately 59%.
The achieved accuracy of approximately 59% on the testing set indicates the performance of the logistic regression model trained with imputed missing values.

Python

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer

# Step 1: Generate Synthetic Dataset with Missing Values
np.random.seed(1)
n_samples = 1000
n_features = 5
X = np.random.rand(n_samples, n_features)
y = np.random.randint(0, 2, n_samples)  # binary target variable
missing_mask = np.random.rand(n_samples, n_features) < 0.2  # 20% missing values
X[missing_mask] = np.nan

# Step 2: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Imputation Method:
# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Train logistic regression model
model_imputed = LogisticRegression()
model_imputed.fit(X_train_imputed, y_train)

# Evaluate model on test set
accuracy_imputed = model_imputed.score(X_test_imputed, y_test)
print("Accuracy with Imputation Method:", accuracy_imputed)

Output:

Accuracy with Imputation Method: 0.59

The output indicates that a logistic regression model trained using the imputation method achieved an accuracy of approximately 59%. Imputation involved replacing missing values with the mean of each feature in the training set, leading to improved performance compared to deletion methods. This accuracy suggests that imputation effectively retained valuable information and contributed to the model's ability to predict the target variable in the testing set.

Handling Missing Data in Logistic Regression using Missingness Indicator

In this approach, we incorporate the missingness mechanism into the analysis by including variables that indicate whether values are missing. This approach allows the model to learn from the missingness pattern and make more accurate predictions.

Pros of Handling Missing Data in Logistic Regression using Missingness Indicator

Information Preservation: The Missingness Indicator method maintains information regarding missing data, enabling the model to address potential patterns or biases linked with missing values.
Ease of Implementation: Implementing the Missingness Indicator is relatively simple, involving the addition of a binary variable to denote missingness, seamlessly integrating into logistic regression models.
Avoidance of Imputation Assumptions: Unlike imputation methods, the Missingness Indicator approach sidesteps the need for assumptions about missing data mechanisms or value estimation, mitigating the risk of bias.

Cons Handling Missing Data in Logistic Regression using Missingness Indicator

Dimensionality Increase: Incorporating Missingness Indicators raises dataset dimensionality, potentially leading to computational complexities, especially with large datasets or numerous missing values.
Efficiency Reduction: The inclusion of Missingness Indicators may reduce model efficiency by introducing noise, particularly if missingness patterns lack informativeness or if many values are missing.
Interpretation Complexity: Interpreting coefficients associated with Missingness Indicators can be more intricate compared to imputed values, as they represent missingness impact on outcomes rather than the missing values themselves, necessitating careful analysis and explanation of results.

Implementation

Generate Synthetic Dataset with Missing Values:
- Generate a synthetic dataset (X) with 1000 samples and 5 features.
- Create a binary target variable (y).
- Introduce missing values (20% missing) into the dataset.
Split Data into Training and Testing Sets:
- Split the dataset into training and testing sets (80-20 split).
Modeling Method:
- Create indicator variables for missing values in the training set (X_train_modeled) using pandas DataFrame.
- Impute missing values in the training set with the mean of each feature.
- Train a logistic regression model (model_modeled) on the training set (X_train_modeled).
Evaluate Model on Test Set:
- Create indicator variables for missing values in the test set (X_test_modeled) using pandas DataFrame.
- Impute missing values in the test set with the mean of each feature.
- Evaluate the trained model (model_modeled) on the test set (X_test_modeled) and calculate the accuracy (accuracy_modeled)..

Python

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Step 1: Generate Synthetic Dataset with Missing Values
np.random.seed(2)
n_samples = 1000
n_features = 5
X = np.random.rand(n_samples, n_features)
y = np.random.randint(0, 2, n_samples)  # binary target variable
missing_mask = np.random.rand(n_samples, n_features) < 0.2  # 20% missing values
X[missing_mask] = np.nan

# Step 2: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Modeling Method:
# Create indicator variables for missing values
X_train_modeled = pd.DataFrame(X_train).copy()
X_train_modeled.columns = [f"Feature_{i}" for i in range(n_features)]
for col in X_train_modeled.columns:
    X_train_modeled[str(col) + '_missing'] = X_train_modeled[col].isnull().astype(int)
X_train_modeled = X_train_modeled.fillna(X_train_modeled.mean())  # Impute missing values with mean

# Train logistic regression model
model_modeled = LogisticRegression()
model_modeled.fit(X_train_modeled, y_train)

# Evaluate model on test set
X_test_modeled = pd.DataFrame(X_test).copy()
X_test_modeled.columns = [f"Feature_{i}" for i in range(n_features)]  # Preserve feature names
for col in X_test_modeled.columns:
    X_test_modeled[str(col) + '_missing'] = X_test_modeled[col].isnull().astype(int)
X_test_modeled = X_test_modeled.fillna(X_test_modeled.mean())  # Impute missing values with mean

accuracy_modeled = model_modeled.score(X_test_modeled, y_test)
print("Accuracy with Modeling Method:", accuracy_modeled)

Output:

Accuracy with Modeling Method: 0.46

The output "Accuracy with Modeling Method: 0.46" indicates that the logistic regression model trained using the specified method achieved an accuracy of approximately 0.46 (46%) on the testing set. This means that the model correctly predicted the target variable (binary outcome) for about 46% of the instances in the testing set.

Conclusion

Handling missing data is crucial for building reliable logistic regression models. By understanding the types of missing data and employing appropriate techniques such as imputation or deletion, researchers can mitigate bias and ensure accurate predictions . With careful consideration and implementation, logistic regression can provide valuable insights into binary outcomes in various fields.

Article Tags :

AI-ML-DS

Machine Learning

AI-ML-DS With Python