Handling Missing Values with Random Forest

Data imputation is a critical challenge in machine learning, with missing values impacting statistical modelling. Random Forest, an ensemble learning method, is a robust solution for accurate predictions, particularly in healthcare. It can handle classification and regression problems, and it is more nuanced than traditional methods. It can handle nan values and decision tree missing values, providing a reliable strategy for data imputation. In this article, we will see how we can handle missing values explicitly using Random Forest.

What are structurally missing data?

Structurally missing data is logically undefined and not random, often due to a lack of applicable data fields. It is not due to error or randomness but logically cannot exist under certain conditions.

Handling Structurally Missing Data:

Recoding and Filtering: Address structurally missing data by recoding or filtering out instances.
Modeling Considerations: Incorporate variables with missing data as interaction terms, without main effect.
Population Considerations: Recognize that missing data represents different populations, and informs decision on data drop or omission.

Understanding and handling structurally missing data is crucial for accurate analysis and modeling, allowing researchers to make informed decisions without bias or inaccuracies.

MCAR (Missing Completely At Random): Uniform absence of data across all observations, reducing analyzable population and statistical power but not introducing bias.
MAR (Missing At Random): Missing data linked to observed data but not the missing data, requiring methods like Multiple Imputation and Maximum Likelihood for accurate handling.
NMAR (Not Missing At Random): Complex scenario where missing data is dependent on unobserved values, challenging standard imputation techniques and requiring specialized methods for accurate analysis.

Imputation Techniques for Handling Missing Values with Random Forest

Random Forest Imputation: Utilizes Random Forest to handle missing data, with techniques like proximity imputation and on-the-fly imputation for complex datasets. Requires careful parameter tuning but can effectively capture complex data relationships.
Miss Forest: An efficient data imputation algorithm using Random Forest, able to handle mixed data types without pre-processing and offering robustness with built-in feature selection. It outperforms KNN-Impute and is particularly effective in imputing missing laboratory data for predictive models in medicine.
MICE Forest: Integrates Random Forest models into MICE for high-precision imputation. It starts with preliminary imputation and refines using Random Forests, offering efficiency in hazard ratio estimates and suitability for complex datasets with missing data.

Handling Missing Values with Random Forest using Python

In this section, we will walk through the process of handling missing values in a dataset using Random Forest as a predictive model. Specifically, we'll focus on predicting missing 'Age' values in the Titanic dataset, which is a classic dataset used in machine learning and data analysis

Step 1: Importing Necessary Libraries

Python

# Import Libraries
import pandas as pd
import numpy as np

Step 2: Loading Datasets

Here, we are using this dataset.

Python

# Importing dataset and setting 'PassengerId' as index
Data = pd.read_csv('Data.csv', index_col='PassengerId')
Data.head()

Output:

    Survived    Pclass    Name    Sex    Age    SibSp    Parch    Ticket    Fare    Cabin    Embarked
PassengerId                                            
1    0    3    Braund, Mr. Owen Harris    male    22.0    1    0    A/5 21171    7.2500    NaN    S
2    1    1    Cumings, Mrs. John Bradley (Florence Briggs Th...    female    38.0    1    0    PC 17599    71.2833    C85    C
3    1    3    Heikkinen, Miss. Laina    female    26.0    0    0    STON/O2. 3101282    7.9250    NaN    S
4    1    1    Futrelle, Mrs. Jacques Heath (Lily May Peel)    female    35.0    1    0    113803    53.1000    C123    S
5    0    3    Allen, Mr. William Henry    male    35.0    0    0    373450    8.0500    NaN    S

Step 3: Data Preprocessing

Handling Missing Values:
The code Data.isnull().sum() is used to check for missing values in a DataFrame called Data.

Python

# Missing Values
Data.isnull().sum()

Output:

Survived        0 
Pclass             0 
Name             0
 Sex                 0
 Age               177 
SibSp              0
 Parch             0 
Ticket              0
 Fare                0
 Cabin            687
 Embarked      2
 dtype: int64

This part of the code removes the 'Cabin' column from the DataFrame Data. The columns=['Cabin'] argument specifies that we want to drop the 'Cabin' column, and axis=1 indicates that we are dropping a column (as opposed to a row).

Python

# Dropping 'Cabin' column due to missing values
Data = Data.drop(columns=['Cabin'], axis=1)
Data.head()

Output:

    Survived    Pclass    Name    Sex    Age    SibSp    Parch    Ticket    Fare    Embarked
PassengerId                                        
1    0    3    Braund, Mr. Owen Harris    male    22.0    1    0    A/5 21171    7.2500    S
2    1    1    Cumings, Mrs. John Bradley (Florence Briggs Th...    female    38.0    1    0    PC 17599    71.2833    C
3    1    3    Heikkinen, Miss. Laina    female    26.0    0    0    STON/O2. 3101282    7.9250    S
4    1    1    Futrelle, Mrs. Jacques Heath (Lily May Peel)    female    35.0    1    0    113803    53.1000    S
5    0    3    Allen, Mr. William Henry    male    35.0    0    0    373450    8.0500    S

The code Data.Embarked.value_counts() is used to count the number of occurrences of each unique value in the 'Embarked' column of the DataFrame Data.

Python

Data.Embarked.value_counts()

Output:

 S    644
 C    168
 Q     77
 Name: Embarked, dtype: int64

The code calculates the most frequent category in the 'Embarked' column using value_counts(). The index[0] part retrieves the first (i.e., most frequent) category from the resulting Series.

Python

# As 'S' is the Most frequent category iam going to replace the null values with Most frequent Value i.e, Median
Data['Embarked'].fillna(Data['Embarked'].value_counts().index[0], inplace=True)

DataWithAge = Data[pd.isnull(Data['Age']) == False]: This line creates a new DataFrame DataWithAge that contains only the rows where the 'Age' column is not null. It uses the pd.isnull(Data['Age']) == False condition to select rows where the 'Age' column is not null.

DataWithoutAge = Data[pd.isnull(Data['Age'])]: This line creates a new DataFrame DataWithoutAge that contains only the rows where the 'Age' column is null. It uses the pd.isnull(Data['Age']) condition to select rows where the 'Age' column is null.
print(DataWithAge.shape, DataWithoutAge.shape): This line prints the shape of the two DataFrames DataWithAge and DataWithoutAge. The shape attribute of a DataFrame returns a tuple representing the dimensions of the DataFrame (number of rows, number of columns).
- DataWithAge.shape will give the number of rows and columns in DataWithAge.
- DataWithoutAge.shape will give the number of rows and columns in DataWithoutAge.

Python

# Splitting data into sets with and without missing 'Age' values
DataWithAge = Data[pd.isnull(Data['Age']) == False]
DataWithoutAge = Data[pd.isnull(Data['Age'])]
# code
print(DataWithAge.shape, DataWithoutAge.shape)

Output:

 (714, 10) (177, 10)

Features is a list containing the names of the selected features. These features are:
- 'Survived': Whether the passenger survived or not (1 = Yes, 0 = No)
- 'Pclass': Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
- 'Age': Age of the passenger
- 'SibSp': Number of siblings/spouses aboard
- 'Parch': Number of parents/children aboard
- 'Fare': Passenger fare

Python

# As we Focused on Filling Missing values iam selecting only features that are important.
Features = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

One-Hot Encoding

One-hot encodes categorical variables ('Embarked' and 'Sex') in the DataWithAge and DataWithoutAge DataFrames, creating new binary columns for each category.
Selects a subset of features (Features) from both DataFrames, including 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', and 'Fare'.
Concatenates the selected features and one-hot encoded columns to create the training set (TrainSet) and test set (TestSet) for further analysis.

Python

# Additionally, categorical variables must be encoded as numeric values. This task can be done using one-hot encoding
one_hot_embarked = pd.get_dummies(DataWithAge['Embarked'], drop_first=True)
one_hot_sex = pd.get_dummies(DataWithAge['Sex'], drop_first=True)
DataWithAge = DataWithAge[Features]
TrainSet = pd.concat([DataWithAge, one_hot_sex, one_hot_embarked], axis=1)

one_hot_embarked = pd.get_dummies(DataWithoutAge['Embarked'], drop_first=True)
one_hot_sex = pd.get_dummies(DataWithoutAge['Sex'], drop_first=True)
DataWithoutAge = DataWithoutAge[Features]
TestSet = pd.concat([DataWithoutAge, one_hot_sex, one_hot_embarked], axis=1)

Step 4: Model Building

Importing the Random Forest Regressor:
- from sklearn.ensemble import RandomForestRegressor: This line imports the RandomForestRegressor class from the sklearn.ensemble module, which is used to train a random forest regression model.
Creating the Random Forest Regressor:
- rf_age = RandomForestRegressor(): This line creates an instance of the RandomForestRegressor class, which will be used to train the model.
Training the Model:
- rf_age.fit(TrainSet[Independent_Features], TrainSet['Age']): This line trains the random forest regressor model (rf_age) using the features (Independent_Features) as input and the 'Age' column from TrainSet as the target variable. The fit method fits the model to the training data, allowing it to learn the relationship between the features and the target variable.

Python

# Now the crucial part. It is the time to train the Random Forest regressor and predict the values of the “Age” column
from sklearn.ensembleimport RandomForestRegressor
rf_age = RandomForestRegressor()
# Training
rf_age.fit(TrainSet[Independent_Features], TrainSet['Age'])

Step 5: Prediction

Predicted_Ages = rf_age.predict(TestSet[Independent_Features]): This line uses the trained random forest regressor (rf_age) to predict the 'Age' values in the test set (TestSet). The predict method takes the independent features (Independent_Features) from the test set as input and returns an array of predicted 'Age' values.

Python

# Predicting missing 'Age' values in the test set
Predicted_Ages = rf_age.predict(TestSet[Independent_Features])
Predicted_Ages

Output:

array([42.85055556, 35.97916667, 14.9       , 33.98904762, 18.7       ,
        27.4787528 , 36.16666667, 19.15      , 22.46666667, 33.444     ,
        31.494228  , 41.00333333, 19.15      , 24.48333333, 33.6       ,
        41.1       , 11.009     , 27.4787528 , 31.494228  , 19.15      ,  
      31.494228  , 31.494228  , 27.4787528 , 26.44335664, 18.9       ,  
      31.494228  , 50.64722222, 16.56666667, 29.35      , 29.97451441, 
       25.18416667, 10.69333333, 35.        , 58.9       ,  4.23      ,
     ...
    50.64722222, 13.25      , 49.1       , 38.81666667, 25.        ,        34.2 
      , 34.645     , 26.60555556, 31.494228  , 38.55      ,        10.69333333, 27.325    
 , 26.60555556, 13.25      , 24.63087302,        27.4787528 , 26.3       ])

Casting Predicted Ages to Integers:
- TestSet['Age'] = Predicted_Ages.astype(int): This line casts the predicted 'Age' values (Predicted_Ages) to integers using the astype(int) method and assigns them to the 'Age' column in the test set (TestSet). This step ensures that the 'Age' column contains only integers, consistent with the original dataset.
Concatenating Training and Test Datasets:
- Titanic_set = TrainSet.append(TestSet): This line concatenates the training set (TrainSet) and the modified test set (TestSet with missing 'Age' values replaced by predicted values) to create a final dataset (Titanic_set) with no missing 'Age' values. The append method is used to combine the two datasets along the rows.

Python

# In the original dataset, the “Age” column contains only integers,
#so I am going to cast the generated values to “int” and replace the missing age values with data predicted by the model.
TestSet['Age'] = Predicted_Ages.astype(int)
#concatenates the training and test datasets to create a final dataset with no missing 'Age' values.
Titanic_set = TrainSet.append(TestSet)
# Final Dataset with No Null Values in Age.
Titanic_set.head()

Output:

    Survived    Pclass    Age    SibSp    Parch    Fare    male    Q    S
0    0    3    22.0    1    0    7.2500    True    False    True
1    1    1    38.0    1    0    71.2833    False    False    False
2    1    3    26.0    0    0    7.9250    False    False    True
3    1    1    35.0    1    0    53.1000    False    False    True
4    0    3    35.0    0    0    8.0500    True    False    True

The code Titanic_set.shape returns the dimensions of the DataFrame Titanic_set, which represents the combined dataset containing both the original training data and the test data with missing 'Age' values replaced by predicted values.
The shape attribute of a DataFrame provides information about the number of rows and columns in the DataFrame.

Python

Titanic_set.shape

Output:

(891, 9)

The code Titanic_set.isnull().sum() is used to check for missing values in the Titanic_set DataFrame after replacing missing 'Age' values with predicted values.

Python

# Final check for missing values
Titanic_set.isnull().sum()

Output:

Survived     0 
Pclass         0
 Age            0
 SibSp         0
 Parch         0
 Fare           0
 male          0
 Q                0
 S                 0 
dtype: int64

The output indicates that there are no missing values in any of the columns of the Titanic_set DataFrame after replacing missing 'Age' values with predicted values and performing one-hot encoding on categorical variables. Each number in the output represents the count of missing values for the corresponding column. Since all counts are 0, it means that there are no missing values in any of the columns.

Article Tags :

AI-ML-DS

Machine Learning

AI-ML-DS With Python