Open In App

How to fit categorical data types for random forest classification?

Categorical variables are an essential component of many datasets, representing qualitative characteristics rather than numerical values. While random forest classification is a powerful machine-learning technique, it typically requires numerical input data. Therefore, encoding categorical variables into a suitable format is a crucial step in preparing data for random forest classification. In this article, we'll explore different encoding methods and their applications in fitting categorical data types for random forest classification.

Types of Encoding for Random Forest Classification

How to fit categorical data types for random forest classification in Python?

Handling categorical data in machine learning involves converting discrete category values into numerical representations suitable for models like random forests. Techniques include Label Encoding, One-Hot Encoding, and Target Encoding, each with unique advantages and considerations based on the nature of the categorical variable and the model requirements. The choice of encoding method impacts model performance and should be selected carefully based on the data characteristics and modeling goals.

Implementation of fitting categorical data types for random forest classification

Loading the dataset

data = 'car-evaluation-data-set/car_evaluation.csv'
df = pd.read_csv(data, header=None)
df.head()

Output:

    0    1    2    3    4    5    6
0    vhigh    vhigh    2    2    small    low    unacc
1    vhigh    vhigh    2    2    small    med    unacc
2    vhigh    vhigh    2    2    small    high    unacc
3    vhigh    vhigh    2    2    med    low    unacc
4    vhigh    vhigh    2    2    med    med    unacc

Renaming Columns

col_names = ['Cost', 'Maintenance', 'Doors', 'Persons', 'Luggage boot', 'Safety', 'Class']
df.columns = col_names
df.columns = col_names
df.head()

Output:


Cost    Maintenance    Doors    Persons    Luggage boot    Safety    Class
0    vhigh    vhigh    2    2    small    low    unacc
1    vhigh    vhigh    2    2    small    med    unacc
2    vhigh    vhigh    2    2    small    high    unacc
3    vhigh    vhigh    2    2    med    low    unacc
4    vhigh    vhigh    2    2    med    med    unacc


Declaring Feature and Target vector

X = df.drop(['Class'], axis=1)
y = df['Class']


Splitting data into Train and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)


Ordinal Encoding

# import category encoders
import category_encoders as ce
ordinal_encoder = ce.OrdinalEncoder(cols=col_names[:-1])
X_train_oe = X_train.copy()
X_test_oe = X_test.copy()
X_train_oe = ordinal_encoder.fit_transform(X_train_oe)
X_test_oe = ordinal_encoder.transform(X_test_oe)
X_train_oe.head()


Output:

    Cost    Maintenance    Doors    Persons    Luggage boot    Safety
107    1    1    1    1    1    1
901    2    1    2    2    2    2
1709    3    2    1    3    1    1
706    4    3    3    3    3    2
678    4    3    2    3    3    3

Random Forest Classification on Ordinal encoded data

# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Initialize and fit RandomForestClassifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train_oe, y_train)
y_pred_oe = rf_classifier.predict(X_test_oe)

# Calculate accuracy
accuracy_oe = accuracy_score(y_test, y_pred_oe)
print("Ordinal Encoder Accuracy: ", accuracy_oe)

Output:

Ordinal Encoder Accuracy:  0.9566473988439307

One-Hot Encoding

one_hot = ce.OneHotEncoder(cols=col_names[:-1])
X_train_oh = X_train.copy()
X_test_oh = X_test.copy()
X_train_oh = one_hot.fit_transform(X_train_oh)
X_test_oh = one_hot.transform(X_test_oh)
X_train_oh.head()

Output:

    Cost_1    Cost_2    Cost_3    Cost_4    Maintenance_1    Maintenance_2    Maintenance_3    Maintenance_4    Doors_1    Doors_2    ...    Doors_4    Persons_1    Persons_2    Persons_3    Luggage boot_1    Luggage boot_2    Luggage boot_3    Safety_1    Safety_2    Safety_3
107    1    0    0    0    1    0    0    0    1    0    ...    0    1    0    0    1    0    0    1    0    0
901    0    1    0    0    1    0    0    0    0    1    ...    0    0    1    0    0    1    0    0    1    0
1709    0    0    1    0    0    1    0    0    1    0    ...    0    0    0    1    1    0    0    1    0    0
706    0    0    0    1    0    0    1    0    0    0    ...    0    0    0    1    0    0    1    0    1    0
678    0    0    0    1    0    0    1    0    0    1    ...    0    0    0    1    0    0    1    0    0    1
5 rows × 21 columns


Random Forest Classification on One-hot encoded data

rf_classifier.fit(X_train_oh, y_train)
y_pred_oh = rf_classifier.predict(X_test_oh)

# Calculate accuracy
accuracy_oh = accuracy_score(y_test, y_pred_oh)
print("One-Hot Encoder Accuracy: ", accuracy_oh)

Output:

One-Hot Encoder Accuracy:  0.9595375722543352

Target Encoding

target_encoder = ce.TargetEncoder(cols=col_names[:-1])
oe = ce.OrdinalEncoder(cols=["Class"])
y_train_oe = oe.fit_transform(y_train)
y_test_oe = oe.transform(y_test)
X_train_te = X_train.copy()
X_test_te = X_test.copy()
X_train_te = target_encoder.fit_transform(X_train_te, y_train_oe)
X_test_te = target_encoder.transform(X_test_te, y_test_oe)
X_train_te.head()


Output:


Cost    Maintenance    Doors    Persons    Luggage boot    Safety
107    1.168639    1.159292    1.466667    1.596950    1.522777    1.738462
901    1.521127    1.159292    1.397661    1.627907    1.295896    1.513100
1709    1.684814    1.688623    1.466667    1.000000    1.522777    1.738462
706    1.264706    1.517045    1.450867    1.000000    1.421397    1.513100
678    1.264706    1.517045    1.397661    1.000000    1.421397    1.000000

Random Forest Classification on Target encoded data

rf_classifier.fit(X_train_te, y_train)
y_pred_te = rf_classifier.predict(X_test_te)

# Calculate accuracy
accuracy_te = accuracy_score(y_test, y_pred_te)
print("Target Encoder Accuracy: ", accuracy_te)

Output:

Target Encoder Accuracy:  0.9739884393063584

In conclusion, the choice of encoding technique for categorical variables in random forest classification significantly influences model performance. Ordinal Encoding preserves ordinal relationships, One-Hot Encoding handles unordered categories effectively, and Target Encoding captures predictive information. Understanding these techniques empowers data scientists to preprocess categorical data effectively, enhancing model accuracy and interpretability.


Article Tags :