Passing categorical data to Sklearn Decision Tree

Last Updated : 04 Mar, 2024

Theoretically, decision trees are capable of handling numerical as well as categorical data, but, while implementing, we need to prepare the data for classification. There are two methods to handle the categorical data before training: one-hot encoding and label encoding. In this article, we understand how each method helps in converting categorical data and difference between both.

Role of Categorical Data on Decision Tree Performance

The role of categorical data in decision tree performance is significant and has implications for how the tree structures are formed and how well the model generalizes to new data. Decision trees, being a non-linear model, can handle both numerical and categorical features. The treatment of categorical data becomes crucial during the tree-building process.

When using decision trees with categorical data, the algorithm needs to determine the best splits at each node. For numerical features, this involves finding thresholds that optimize a certain criterion (e.g., Gini impurity or information gain). However, handling categorical variables requires different strategies.

Label Encoding: If categorical data is label encoded, the decision tree can naturally interpret the encoded values as ordinal, assuming there is an inherent order among the categories.
One-Hot Encoding: Allows the decision tree to make binary decisions based on the presence or absence of a specific category, avoiding assumptions of ordinal relationships.

Handling Categorical Data using Label Encoding

Categorical data like colors or categories, are convert into numerical values. Each category is assigned a unique code, enabling the computer to understand and process the information. For example, instead of using red or blue, we represent them with numbers, such as 1 for red and 2 for blue.

Label encoding involves converting categorical data into numerical format by assigning a distinct integer label to each category or class. In this encoding scheme, each unique category is mapped to an integer, making it easier for machine learning models to process and analyze the data.

Example: Code Implementation

Python3

from sklearn.preprocessing import LabelEncoder
colors = ['red', 'blue', 'green', 'yellow', 'blue', 'green'] # sample
label_encoder = LabelEncoder()
 
encoded_colors = label_encoder.fit_transform(colors) # Fit and transform the data
print("Original Colors:", colors)
print("Encoded Colors:", encoded_colors)

Output:

Original Colors: ['red', 'blue', 'green', 'yellow', 'blue', 'green']
Encoded Colors: [2 0 1 3 0 1]

Handling Categorical Data using One-Hot Encoding

One-hot encoding is a technique is a method utilized for expressing categorical variables as binary vectors. In this encoding scheme, each category is transformed into a binary vector where all elements are zero except for the one corresponding to the category’s index.

Imagine you have a list of fruits: apples, bananas, and oranges. Now, One-Hot Encoding is like making a checklist for each fruit. If an apple is on the list, you put a check in the “apple” column and leave the others blank. If it’s a banana, you check the “banana” column, and so on. So, instead of using numbers, we create separate columns for each fruit. If the fruit is there, the column gets a check (1); if not, it stays blank (0). This way, the computer knows exactly which fruits are present without getting confused about which one is “bigger” or “smaller.” Each fruit gets its own space on the checklist.

In the code snippet, the categorical data is reshaped into a 2D array because the OneHotEncoder in scikit-learn expects its input to be a 2D array or sparse matrix.

After, the original categorical data is transformed into a sparse matrix of one-hot encoded values.

Python3

from sklearn.preprocessing import OneHotEncoder
import pandas as pd
colors = ['red', 'blue', 'green', 'yellow', 'blue', 'green']
 
# Reshape the data to a 2D array (required by OneHotEncoder)
colors_reshaped = pd.DataFrame(colors, columns=['Color'])
onehot_encoder = OneHotEncoder(sparse=False, drop='first')  # 'first' to drop the first category to avoid multicollinearity
onehot_encoded = onehot_encoder.fit_transform(colors_reshaped) # Fit and transform the data
 
onehot_encoded_df = pd.DataFrame(onehot_encoded, columns=onehot_encoder.get_feature_names_out(['Color']))
print("Original Colors:")
print(colors_reshaped)
print("\nOne-Hot Encoded Colors:")
print(onehot_encoded_df)

Output:

Original Colors:
    Color
0     red
1    blue
2   green
3  yellow
4    blue
5   green

One-Hot Encoded Colors:
   Color_green  Color_red  Color_yellow
0          0.0        1.0           0.0
1          0.0        0.0           0.0
2          1.0        0.0           0.0
3          0.0        0.0           1.0
4          0.0        0.0           0.0
5          1.0        0.0           0.0

Label Encoding vs. One-Hot Encoding for Decision Trees

Label encoding and one-hot encoding are two common techniques used to handle categorical data, and each has its considerations when applied to decision trees.

Label encoding involves assigning a unique integer to each category. This encoding can be suitable when there is an inherent ordinal relationship among the categories. For decision trees, label encoding may work well if the tree can naturally interpret the encoded values as representing an order or ranking. However, caution should be exercised when using label encoding with decision trees, as these algorithms might incorrectly assume ordinal relationships that don’t actually exist in the data.

On the other hand, one-hot encoding creates binary columns for each category, representing the presence or absence of a category. This approach is valuable when there is no inherent order among the categories. Decision trees can effectively handle one-hot encoded data because they make binary decisions at each node, considering the presence or absence of a particular feature. One-hot encoding prevents the model from assuming any ordinal relationship between the categories, making it a safer choice when the categorical variables are nominal.

Therefore, the choice between label encoding and one-hot encoding for decision trees depends on the nature of the categorical data.

Conclusion

In conclusion, label encoding and one-hot encoding both techniques are sufficient and can be used for handling categorical data in a Decision Tree Classifier using Python.

Suggest improvement

Decision Tree Classifiers in Julia

Backpropagation in Machine Learning

Share your thoughts in the comments

Passing categorical data to Sklearn Decision Tree

Role of Categorical Data on Decision Tree Performance

Handling Categorical Data using Label Encoding

Python3

Handling Categorical Data using One-Hot Encoding

Python3

Label Encoding vs. One-Hot Encoding for Decision Trees

Conclusion

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?