Concept |
Represents each category as a binary column, where only one column is ‘1’ (hot) and the rest are ‘0’. |
Encodes a categorical variable by leaving one category out in each encoding, resulting in a numerical representation. |
Number of Columns |
Number of columns equals the number of unique categories in the variable. |
Number of columns equals the number of unique categories minus one. |
Sparsity |
Generates a sparse matrix with mostly ‘0’ values, as only one column is ‘1’ for each observation. |
Generally less sparse compared to one-hot encoding, as one column is omitted for each observation. |
Collinearity |
May lead to multicollinearity issues since the presence of one variable can be perfectly predicted from the others. |
Reduces collinearity issues, as one category is omitted, providing linearly independent features. |
Interpretability |
Each category has a distinct column, making interpretation straightforward. |
Interpretability may be more challenging as the encoded values are derived based on leaving out one category. |
Computational Complexity |
Can be computationally expensive when dealing with a large number of unique categories. |
Generally less computationally expensive as it involves fewer columns and may be more efficient for large datasets. |
Use Cases |
Suitable for scenarios where interpretability and the individual impact of each category are essential. |
Useful when dealing with multicollinearity issues and when a simpler, less sparse representation is desired. |
Example |
Consider a variable “Color” with categories: Red, Green, Blue. Encoded as: Red: [1, 0, 0], Green: [0, 1, 0], Blue: [0, 0, 1]. |
If leaving out ‘Green’, the encoding for “Color” would be: Red: [1, 0], Blue: [0, 1]. |