Handling Missing Data in Decision Tree Models

Last Updated : 19 Mar, 2024

Decision trees, a popular and powerful tool in data science and machine learning, are adept at handling both regression and classification tasks. However, their performance can suffer due to missing or incomplete data, which is a frequent challenge in real-world datasets. This article delves into the intricacies of handling missing data in decision tree models and explores strategies to mitigate its impact.

Handling Missing Data in Decision Trees

Decision trees handle missing data by either ignoring instances with missing values, imputing them using statistical measures, or creating separate branches. During prediction, the tree follows the training strategy, applying imputation or navigating a dedicated branch for instances with missing data.

Types of Missing Data

Before tackling strategies, it’s crucial to understand the various types of missing data:

Missing Completely at Random (MCAR): Sporadic missing data points unrelated to known or unknown factors. In MCAR, the occurrence of missing data is entirely random and unrelated to any observed or unobserved factors in the dataset. The missing values are essentially a result of a random process, and there’s no systematic reason for their absence.
Missing at Random (MAR). In MAR, the probability of missing data depends on the observed variables in the dataset, but once those variables are considered, the missingness is random. In other words, the missing values can be predicted or explained by other observed variables, ensuring randomness after accounting for these factors.
Missing Not at Random (MNAR): A systematic pattern exists between the missing data and the missing values themselves.

How Decision Trees Handle Missing Values

Decision trees employ a systematic approach to handle missing data during both training and prediction stages. Here’s a breakdown of these steps:

Attribute Splitting

The algorithm begins by selecting the most suitable feature (based on measures like Gini impurity) to separate the data. If a data point has a missing value in the chosen feature, the tree will utilize the available data to decide which branch to send it down.

When a feature with missing values is chosen for splitting, decision trees consider the available data to decide the appropriate branch.
The decision is based on the available non-missing values in the chosen feature.

Decision trees seamlessly incorporate instances with missing values into their decision-making process during both training and prediction.

Weighted Impurity Calculation

When building the tree, the algorithm chooses the feature that offers the best split at each node.

When a feature with missing values is considered, the algorithm calculates the impurity of both branches: one including instances with missing values and the other without to assess the overall impurity of the data.
The impurity calculation is weighted based on the proportion of instances in each branch.
This ensures that the decision tree incorporates the impact of missing values when assessing the quality of a split.

The algorithm doesn’t disregard missing values but rather weighs their impact when evaluating impurity and making decisions.

Surrogate Splits

Decision trees anticipate and account for missing values during prediction by using surrogate splits. Surrogate splits are backup rules or branches that can be used when the primary split contains missing values.

Decision trees calculate surrogate splits during training, considering the next best options for splitting when the primary feature has missing values.
When making predictions for instances with missing values, the tree follows the surrogate splits to determine the appropriate branch.

The anticipation of missing values in decision tree training allows for the creation of surrogate splits, enhancing the model’s robustness during prediction.

Handling Missing Data in Decision Tree Models: Example

Let’s understand with an illustrative example. Imagine building a decision tree to predict flight delays. Some flights might have missing data for the “weather” attribute. The approach involves:

Optimal Feature Selection: The decision tree starts by selecting the most informative feature for the initial split, such as “time of day.” The goal is to create subsets that best distinguish between delayed and non-delayed flights.
Impurity Calculation with Missing Data Weights: As the tree grows, instances with missing “weather” data are encountered.
- Impurity measures (like entropy or Gini impurity) are calculated, taking into account the weights of instances where “weather” data is missing.
- This ensures that the decision tree considers the impact of missing values on the overall impurity calculation.
Surrogate Splits Implementation: To handle missing “weather” data in subsequent nodes, the decision tree incorporates surrogate splits.
- Surrogate splits act as backup rules or alternative features, like “airline,” in case the primary feature (“time of day”) has missing values.
- This adaptive strategy ensures that the model can still make informed decisions even when certain features, such as “weather,” are unavailable.

By following these steps, decision trees adapt to missing data and maintain their ability to accurately predict outcomes, even when certain features are unavailable. This adaptability strengthens the model’s resilience in real-world applications.

Handling Missing Data in Decision Tree in Python

Decision tree algorithms in Python, particularly those within the scikit-learn library, come equipped with built-in mechanisms for handling missing data during tree construction. Below is the step-by-step approach to handle missing data in python.

Import Libraries: Import necessary libraries from scikit-learn like DecisionTreeClassifier.
Load and Split Data: Load your dataset using tools like pandas and split it into features (X) and target variable (y). Further divide the data into training and testing sets using train_test_split.
Missing Value Handling: Since Python’s decision trees natively handle missing data, if still exists address any remaining missing values using techniques like mean or median imputation.
Build the Decision Tree: Create the model (e.g., DecisionTreeClassifier) and train it on the training dataset. There’s no need for manual pre-processing of missing values as the algorithm handles them automatically during tree construction.
Make Predictions: Once trained, use the model to make predictions on new data, including those with missing values.

Conclusion

In conclusion, decision trees effectively handle missing data through attribute splitting, weighted impurity calculation, and surrogate splits. Python’s scikit-learn library simplifies the process, enhancing model adaptability and predictive accuracy in real-world scenarios.

Suggest improvement

Handling Missing Values in Time Series Data

Share your thoughts in the comments