Before learning about dummy variable trap, let’s first understand what actually dummy variable is.
Dummy Variable in Regression Models:
In statistics, especially in regression models, we deal with various kind of data. The data may be quantitative (numerical) or qualitative (categorical). The numerical data can be easily handled in regression models but we can’t use categorical data directly, it needs to be transformed in some way.
For transforming categorical attribute to numerical attribute, we can use label encoding procedure (label encoding assigns a unique integer to each category of data). But this procedure is not alone that much suitable, hence, One hot encoding is used in regression models following label encoding. This enables us to create new attributes according to the number of classes present in the categorical attribute i.e if there are n number of categories in categorical attribute, n new attributes will be created. These attributes created are called Dummy Variables. Hence, dummy variables are “proxy” variables for categorical data in regression models.
These dummy variables will be created with one hot encoding and each attribute will have value either 0 or 1, representing presence or absence of that attribute.
Dummy Variable Trap:
The Dummy variable trap is a scenario where there are attributes which are highly correlated (Multicollinear) and one variable predicts the value of others. When we use one hot encoding for handling the categorical data, then one dummy variable (attribute) can be predicted with the help of other dummy variables. Hence, one dummy variable is highly correlated with other dummy variables. Using all dummy variables for regression models lead to dummy variable trap. So, the regression models should be designed excluding one dummy variable.
For Example –
Let’s consider the case of gender having two values male (0 or 1) and female (1 or 0). Including both the dummy variable can cause redundancy because if a person is not male in such case that person is a female, hence, we don’t need to use both the variables in regression models. This will protect us from dummy variable trap.
- Understanding Logistic Regression
- Multiple Linear Regression using R
- Regression and Classification | Supervised Machine Learning
- Linear Regression using PyTorch
- Identifying handwritten digits using Logistic Regression in PyTorch
- Simple Linear-Regression using R
- Linear Regression Using Tensorflow
- ML | Linear Regression
- Gradient Descent in Linear Regression
- Mathematical explanation for Linear Regression working
- ML | Boston Housing Kaggle Challenge with Linear Regression
- ML | Normal Equation in Linear Regression
- Python | Implementation of Polynomial Regression
- Python | Decision Tree Regression using sklearn
- ML | Logistic Regression using Tensorflow
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.