How to fit categorical data types for random forest classification?

Categorical variables are an essential component of many datasets, representing qualitative characteristics rather than numerical values. While random forest classification is a powerful machine-learning technique, it typically requires numerical input data. Therefore, encoding categorical variables into a suitable format is a crucial step in preparing data for random forest classification. In this article, we'll explore different encoding methods and their applications in fitting categorical data types for random forest classification.

Types of Encoding for Random Forest Classification

Ordinal Encoder: Ordinal encoding is particularly useful when categorical variables have an inherent order or rank. In this method, each category is assigned a unique integer value based on its position in the ordered sequence.
One-hot Encoding: One-hot encoding is a popular technique for handling categorical variables, especially when the categories are not inherently ordered. In this method, each category is represented by a binary indicator variable (0 or 1).
Target Encoding: Target encoding, also known as mean encoding, replaces each category with the mean of the target variable for that category. This method is particularly useful when dealing with high-cardinality categorical variables, where one-hot encoding would result in a large number of binary columns. By encoding categories based on their relationship with the target variable, target encoding captures valuable information about the predictive power of each category. However, it's essential to be cautious when using target encoding to avoid overfitting, especially with small or imbalanced datasets.

How to fit categorical data types for random forest classification in Python?

Handling categorical data in machine learning involves converting discrete category values into numerical representations suitable for models like random forests. Techniques include Label Encoding, One-Hot Encoding, and Target Encoding, each with unique advantages and considerations based on the nature of the categorical variable and the model requirements. The choice of encoding method impacts model performance and should be selected carefully based on the data characteristics and modeling goals.

Implementation of fitting categorical data types for random forest classification

Loading the dataset

We will load the dataset using pandas .
It specifies header=None, indicating that the CSV file doesn't contain a header row.
Lastly, it displays the first few rows of the DataFrame using the head() method, providing a quick preview of the data.

Python3

data = 'car-evaluation-data-set/car_evaluation.csv'
df = pd.read_csv(data, header=None)
df.head()

Output:

    0    1    2    3    4    5    6
0    vhigh    vhigh    2    2    small    low    unacc
1    vhigh    vhigh    2    2    small    med    unacc
2    vhigh    vhigh    2    2    small    high    unacc
3    vhigh    vhigh    2    2    med    low    unacc
4    vhigh    vhigh    2    2    med    med    unacc

Renaming Columns

col_names is a list containing the desired column names.
df.columns = col_names assigns the column names from the col_names list to the DataFrame df.

Python3

col_names = ['Cost', 'Maintenance', 'Doors', 'Persons', 'Luggage boot', 'Safety', 'Class']
df.columns = col_names
df.columns = col_names
df.head()

Output:


Cost    Maintenance    Doors    Persons    Luggage boot    Safety    Class
0    vhigh    vhigh    2    2    small    low    unacc
1    vhigh    vhigh    2    2    small    med    unacc
2    vhigh    vhigh    2    2    small    high    unacc
3    vhigh    vhigh    2    2    med    low    unacc
4    vhigh    vhigh    2    2    med    med    unacc

Declaring Feature and Target vector

The code snippet creates a feature matrix X by dropping the column labeled 'Class' from the DataFrame df, using the drop() method along the columns axis (axis=1).
It creates a target vector y by selecting only the column labeled 'Class' from the DataFrame df.

Python3

X = df.drop(['Class'], axis=1)
y = df['Class']

Splitting data into Train and Test set

This code snippet splits the data into training and testing sets for both features (X) and target (y) variables using the train_test_split function from scikit-learn.
It splits the features (X) and target (y) data into training (X_train, y_train) and testing (X_test, y_test) sets, with 80% of the data allocated for training and 20% for testing.
The random_state=42 parameter ensures reproducibility of the split, meaning that the same random split will be obtained each time the code is run.

Python3

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Ordinal Encoding

We import the category_encoders library, typically used for encoding categorical variables.
It initializes an OrdinalEncoder object, specifying the columns to encode as the features excluding the last column ('Class') using col_names[:-1].
Copies of the training and testing feature sets (X_train and X_test) are created to preserve the original data.
The fit_transform method of the OrdinalEncoder object is applied to the training feature set (X_train_oe), fitting the encoder and transforming the training data into ordinal encoded format.
Similarly, the transform method is applied to the testing feature set (X_test_oe) to transform it into the same ordinal encoded format.
The head() method is used to display the first few rows of the transformed training feature set (X_train_oe).

Python3

# import category encoders
import category_encoders as ce
ordinal_encoder = ce.OrdinalEncoder(cols=col_names[:-1])
X_train_oe = X_train.copy()
X_test_oe = X_test.copy()
X_train_oe = ordinal_encoder.fit_transform(X_train_oe)
X_test_oe = ordinal_encoder.transform(X_test_oe)
X_train_oe.head()

Output:

    Cost    Maintenance    Doors    Persons    Luggage boot    Safety
107    1    1    1    1    1    1
901    2    1    2    2    2    2
1709    3    2    1    3    1    1
706    4    3    3    3    3    2
678    4    3    2    3    3    3

Random Forest Classification on Ordinal encoded data

The code snippet imports the RandomForestClassifier class from the sklearn.ensemble module, allowing the implementation of a random forest classifier.
It also imports the accuracy_score function from the sklearn.metrics module, which will be used to evaluate the classifier's performance.
An instance of the RandomForestClassifier class is initialized with the random_state=42 parameter, ensuring reproducibility of results by fixing the random number generator seed to 42.
The code fits the RandomForestClassifier (rf_classifier) to the training data (X_train_oe, y_train) using the fit() method.
Predictions are made on the ordinal encoded testing feature set (X_test_oe) using the predict() method, resulting in predicted target values (y_pred_oe).
The accuracy of the model is calculated by comparing the predicted target values (y_pred_oe) with the actual target values from the testing set (y_test) using the accuracy_score() function.

Python3

# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Initialize and fit RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train_oe, y_train)
y_pred_oe = rf_classifier.predict(X_test_oe)

# Calculate accuracy
accuracy_oe = accuracy_score(y_test, y_pred_oe)
print("Ordinal Encoder Accuracy: ", accuracy_oe)

Output:

Ordinal Encoder Accuracy:  0.9566473988439307

One-Hot Encoding

The code initializes a OneHotEncoder object from the category_encoders library, specifying the columns to encode as the features excluding the last column ('Class') using col_names[:-1].
Copies of the training and testing feature sets (X_train and X_test) are created to preserve the original data.
The fit_transform method of the OneHotEncoder object is applied to the training feature set (X_train_oh), fitting the encoder and transforming the training data into one-hot encoded format.
Similarly, the transform method is applied to the testing feature set (X_test_oh) to transform it into the same one-hot encoded format.
The head() method is used to display the first few rows of the transformed training feature set (X_train_oh).

Python3

one_hot = ce.OneHotEncoder(cols=col_names[:-1])
X_train_oh = X_train.copy()
X_test_oh = X_test.copy()
X_train_oh = one_hot.fit_transform(X_train_oh)
X_test_oh = one_hot.transform(X_test_oh)
X_train_oh.head()

Output:

    Cost_1    Cost_2    Cost_3    Cost_4    Maintenance_1    Maintenance_2    Maintenance_3    Maintenance_4    Doors_1    Doors_2    ...    Doors_4    Persons_1    Persons_2    Persons_3    Luggage boot_1    Luggage boot_2    Luggage boot_3    Safety_1    Safety_2    Safety_3
107    1    0    0    0    1    0    0    0    1    0    ...    0    1    0    0    1    0    0    1    0    0
901    0    1    0    0    1    0    0    0    0    1    ...    0    0    1    0    0    1    0    0    1    0
1709    0    0    1    0    0    1    0    0    1    0    ...    0    0    0    1    1    0    0    1    0    0
706    0    0    0    1    0    0    1    0    0    0    ...    0    0    0    1    0    0    1    0    1    0
678    0    0    0    1    0    0    1    0    0    1    ...    0    0    0    1    0    0    1    0    0    1
5 rows × 21 columns

Random Forest Classification on One-hot encoded data

The code fits the RandomForestClassifier (rf_classifier) to the training data (X_train_oh, y_train) using the fit() method.
Predictions are made on the one-hot encoded testing feature set (X_test_oh) using the predict() method, resulting in predicted target values (y_pred_oh).
The accuracy of the model is calculated by comparing the predicted target values (y_pred_oh) with the actual target values from the testing set (y_test) using the accuracy_score() function.

Python3

rf_classifier.fit(X_train_oh, y_train)
y_pred_oh = rf_classifier.predict(X_test_oh)

# Calculate accuracy
accuracy_oh = accuracy_score(y_test, y_pred_oh)
print("One-Hot Encoder Accuracy: ", accuracy_oh)

Output:

One-Hot Encoder Accuracy:  0.9595375722543352

Target Encoding

The code initializes a TargetEncoder object from the category_encoders library, specifying the columns to encode as the features excluding the last column ('Class') using col_names[:-1].
An OrdinalEncoder object (oe) is also initialized to encode the target variable ('Class') to ordinal values.
The target variable is encoded using oe to create y_train_oe and y_test_oe.
Copies of the training and testing feature sets (X_train and X_test) are created to preserve the original data.
The fit_transform method of the TargetEncoder object is applied to the training feature set (X_train_te), fitting the encoder and transforming the training data into target encoded format using the encoded target variable (y_train_oe).
Similarly, the transform method is applied to the testing feature set (X_test_te) to transform it into the same target encoded format using the encoded target variable (y_test_oe).
The head() method is used to display the first few rows of the transformed training feature set (X_train_te).

Python3

target_encoder = ce.TargetEncoder(cols=col_names[:-1])
oe = ce.OrdinalEncoder(cols=["Class"])
y_train_oe = oe.fit_transform(y_train)
y_test_oe = oe.transform(y_test)
X_train_te = X_train.copy()
X_test_te = X_test.copy()
X_train_te = target_encoder.fit_transform(X_train_te, y_train_oe)
X_test_te = target_encoder.transform(X_test_te, y_test_oe)
X_train_te.head()

Output:


Cost    Maintenance    Doors    Persons    Luggage boot    Safety
107    1.168639    1.159292    1.466667    1.596950    1.522777    1.738462
901    1.521127    1.159292    1.397661    1.627907    1.295896    1.513100
1709    1.684814    1.688623    1.466667    1.000000    1.522777    1.738462
706    1.264706    1.517045    1.450867    1.000000    1.421397    1.513100
678    1.264706    1.517045    1.397661    1.000000    1.421397    1.000000

Random Forest Classification on Target encoded data

The code fits the RandomForestClassifier (rf_classifier) to the training data (X_train_te, y_train) using the fit() method.
Predictions are made on the target encoded testing feature set (X_test_te) using the predict() method, resulting in predicted target values (y_pred_te).
The accuracy of the model is calculated by comparing the predicted target values (y_pred_te) with the actual target values from the testing set (y_test) using the accuracy_score() function.

Python3

rf_classifier.fit(X_train_te, y_train)
y_pred_te = rf_classifier.predict(X_test_te)

# Calculate accuracy
accuracy_te = accuracy_score(y_test, y_pred_te)
print("Target Encoder Accuracy: ", accuracy_te)

Output:

Target Encoder Accuracy:  0.9739884393063584

In conclusion, the choice of encoding technique for categorical variables in random forest classification significantly influences model performance. Ordinal Encoding preserves ordinal relationships, One-Hot Encoding handles unordered categories effectively, and Target Encoding captures predictive information. Understanding these techniques empowers data scientists to preprocess categorical data effectively, enhancing model accuracy and interpretability.

Article Tags :

AI-ML-DS

Machine Learning

AI-ML-DS With Python