Open In App

Handling Missing Values with Random Forest

Data imputation is a critical challenge in machine learning, with missing values impacting statistical modelling. Random Forest, an ensemble learning method, is a robust solution for accurate predictions, particularly in healthcare. It can handle classification and regression problems, and it is more nuanced than traditional methods. It can handle nan values and decision tree missing values, providing a reliable strategy for data imputation. In this article, we will see how we can handle missing values explicitly using Random Forest.

What are structurally missing data?

Structurally missing data is logically undefined and not random, often due to a lack of applicable data fields. It is not due to error or randomness but logically cannot exist under certain conditions.

Handling Structurally Missing Data:

  1. Recoding and Filtering: Address structurally missing data by recoding or filtering out instances.
  2. Modeling Considerations: Incorporate variables with missing data as interaction terms, without main effect.
  3. Population Considerations: Recognize that missing data represents different populations, and informs decision on data drop or omission.

Understanding and handling structurally missing data is crucial for accurate analysis and modeling, allowing researchers to make informed decisions without bias or inaccuracies.

Imputation Techniques for Handling Missing Values with Random Forest

Handling Missing Values with Random Forest using Python

In this section, we will walk through the process of handling missing values in a dataset using Random Forest as a predictive model. Specifically, we'll focus on predicting missing 'Age' values in the Titanic dataset, which is a classic dataset used in machine learning and data analysis

Step 1: Importing Necessary Libraries

# Import Libraries
import pandas as pd
import numpy as np

Step 2: Loading Datasets

Here, we are using this dataset.

# Importing dataset and setting 'PassengerId' as index
Data = pd.read_csv('Data.csv', index_col='PassengerId')
Data.head()


Output:

    Survived    Pclass    Name    Sex    Age    SibSp    Parch    Ticket    Fare    Cabin    Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Step 3: Data Preprocessing

Handling Missing Values:
The code Data.isnull().sum() is used to check for missing values in a DataFrame called Data.

# Missing Values
Data.isnull().sum()

Output:

Survived        0 
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
# Dropping 'Cabin' column due to missing values
Data = Data.drop(columns=['Cabin'], axis=1)
Data.head()


Output:

    Survived    Pclass    Name    Sex    Age    SibSp    Parch    Ticket    Fare    Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S
Data.Embarked.value_counts()


Output:

 S    644
C 168
Q 77
Name: Embarked, dtype: int64
# As 'S' is the Most frequent category iam going to replace the null values with Most frequent Value i.e, Median
Data['Embarked'].fillna(Data['Embarked'].value_counts().index[0], inplace=True)



# Splitting data into sets with and without missing 'Age' values
DataWithAge = Data[pd.isnull(Data['Age']) == False]
DataWithoutAge = Data[pd.isnull(Data['Age'])]
# code
print(DataWithAge.shape, DataWithoutAge.shape)

Output:

 (714, 10) (177, 10)
# As we Focused on Filling Missing values iam selecting only features that are important.
Features = ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']


One-Hot Encoding

# Additionally, categorical variables must be encoded as numeric values. This task can be done using one-hot encoding
one_hot_embarked = pd.get_dummies(DataWithAge['Embarked'], drop_first=True)
one_hot_sex = pd.get_dummies(DataWithAge['Sex'], drop_first=True)
DataWithAge = DataWithAge[Features]
TrainSet = pd.concat([DataWithAge, one_hot_sex, one_hot_embarked], axis=1)

one_hot_embarked = pd.get_dummies(DataWithoutAge['Embarked'], drop_first=True)
one_hot_sex = pd.get_dummies(DataWithoutAge['Sex'], drop_first=True)
DataWithoutAge = DataWithoutAge[Features]
TestSet = pd.concat([DataWithoutAge, one_hot_sex, one_hot_embarked], axis=1)

Step 4: Model Building

  1. Importing the Random Forest Regressor:
    • from sklearn.ensemble import RandomForestRegressor: This line imports the RandomForestRegressor class from the sklearn.ensemble module, which is used to train a random forest regression model.
  2. Creating the Random Forest Regressor:
    • rf_age = RandomForestRegressor(): This line creates an instance of the RandomForestRegressor class, which will be used to train the model.
  3. Training the Model:
    • rf_age.fit(TrainSet[Independent_Features], TrainSet['Age']): This line trains the random forest regressor model (rf_age) using the features (Independent_Features) as input and the 'Age' column from TrainSet as the target variable. The fit method fits the model to the training data, allowing it to learn the relationship between the features and the target variable.
# Now the crucial part. It is the time to train the Random Forest regressor and predict the values of the “Age” column
from sklearn.ensembleimport RandomForestRegressor
rf_age = RandomForestRegressor()
# Training
rf_age.fit(TrainSet[Independent_Features], TrainSet['Age'])


Step 5: Prediction

Predicted_Ages = rf_age.predict(TestSet[Independent_Features]): This line uses the trained random forest regressor (rf_age) to predict the 'Age' values in the test set (TestSet). The predict method takes the independent features (Independent_Features) from the test set as input and returns an array of predicted 'Age' values.

# Predicting missing 'Age' values in the test set
Predicted_Ages = rf_age.predict(TestSet[Independent_Features])
Predicted_Ages

Output:

array([42.85055556, 35.97916667, 14.9       , 33.98904762, 18.7       ,
27.4787528 , 36.16666667, 19.15 , 22.46666667, 33.444 ,
31.494228 , 41.00333333, 19.15 , 24.48333333, 33.6 ,
41.1 , 11.009 , 27.4787528 , 31.494228 , 19.15 ,
31.494228 , 31.494228 , 27.4787528 , 26.44335664, 18.9 ,
31.494228 , 50.64722222, 16.56666667, 29.35 , 29.97451441,
25.18416667, 10.69333333, 35. , 58.9 , 4.23 ,
...
50.64722222, 13.25 , 49.1 , 38.81666667, 25. , 34.2
, 34.645 , 26.60555556, 31.494228 , 38.55 , 10.69333333, 27.325
, 26.60555556, 13.25 , 24.63087302, 27.4787528 , 26.3 ])
  1. Casting Predicted Ages to Integers:
    • TestSet['Age'] = Predicted_Ages.astype(int): This line casts the predicted 'Age' values (Predicted_Ages) to integers using the astype(int) method and assigns them to the 'Age' column in the test set (TestSet). This step ensures that the 'Age' column contains only integers, consistent with the original dataset.
  2. Concatenating Training and Test Datasets:
    • Titanic_set = TrainSet.append(TestSet): This line concatenates the training set (TrainSet) and the modified test set (TestSet with missing 'Age' values replaced by predicted values) to create a final dataset (Titanic_set) with no missing 'Age' values. The append method is used to combine the two datasets along the rows.
# In the original dataset, the “Age” column contains only integers,
#so I am going to cast the generated values to “int” and replace the missing age values with data predicted by the model.
TestSet['Age'] = Predicted_Ages.astype(int)
#concatenates the training and test datasets to create a final dataset with no missing 'Age' values.
Titanic_set = TrainSet.append(TestSet)
# Final Dataset with No Null Values in Age.
Titanic_set.head()


Output:

    Survived    Pclass    Age    SibSp    Parch    Fare    male    Q    S
0 0 3 22.0 1 0 7.2500 True False True
1 1 1 38.0 1 0 71.2833 False False False
2 1 3 26.0 0 0 7.9250 False False True
3 1 1 35.0 1 0 53.1000 False False True
4 0 3 35.0 0 0 8.0500 True False True
Titanic_set.shape

Output:

(891, 9)
# Final check for missing values
Titanic_set.isnull().sum()

Output:

Survived     0 
Pclass 0
Age 0
SibSp 0
Parch 0
Fare 0
male 0
Q 0
S 0
dtype: int64

The output indicates that there are no missing values in any of the columns of the Titanic_set DataFrame after replacing missing 'Age' values with predicted values and performing one-hot encoding on categorical variables. Each number in the output represents the count of missing values for the corresponding column. Since all counts are 0, it means that there are no missing values in any of the columns.

Article Tags :