Logistic regression in R Programming is a classification algorithm used to find the probability of event success and event failure. Logistic regression is used when the dependent variable is binary(0/1, True/False, Yes/No) in nature. Logit function is used as a link function in a binomial distribution.
Logistic regression is also known as Binomial logistics regression. It is based on sigmoid function where output is probability and input can be from -infinity to +infinity.
Logistics regression is also known as generalized linear model. As it is used as a classification technique to predict a qualitative response, Value of y ranges from 0 to 1 and can be represented by following equation:
p is probability of characteristic of interest. The odds ratio is defined as the probability of success in comparison to the probability of failure. It is a key representation of logistic regression coefficients and can take values between 0 and infinity. Odds ratio of 1 is when the probability of success is equal to the probability of failure. Odds ratio of 2 is when the probability of success is twice the probability of failure. Odds ratio of 0.5 is when the probability of failure is twice the probability of success.
Since we are working with a binomial distribution(dependent variable), we need to choose a link function that is best suited for this distribution.
It is logit function. In the equation above, the parenthesis is chosen to maximize the likelihood of observing the sample values rather than minimizing the sum of squared errors(like ordinary regression). The logit is also known as a log of odds. The logit function must be linearly related to the independent variables. This is from equation A, where the left-hand side is a linear combination of x. This is similar to the OLS assumption that y be linearly related to x.
Variables b0, b1, b2 … etc are unknown and must be estimated on available training data. In a logistic regression model, multiplying b1 by one unit changes the logit by b0. The P changes due to a one-unit change will depend upon the value multiplied. If b1 is positive then P will increase and if b1 is negative then P will decrease.
mtcars(motor trend car road test) comprises fuel consumption, performance and 10 aspects of automobile design for 32 automobiles. It comes pre installed with
dplyr package in R.
Performing Logistic regression on dataset
Logistic regression is implemented in R using
glm() by training the model using features or variables in the dataset.
wt influences dependent variables positively and one unit increase in wt increases the log of odds for vs =1 by 1.44. disp influences dependent variables negatively and one unit increase in disp decreases the log of odds for vs =1 by 0.0344. Null deviance is 31.755(fit dependent variable with intercept) and Residual deviance is 14.457(fit dependent variable with all independent variable). AIC(Alkaline Information criteria) value is 20.457 i.e the lesser the better for the model. Accuracy comes out to be 0.75 i.e 75%.
Model is evaluated using the Confusion matrix, AUC(Area under the curve), and ROC(Receiver operating characteristics) curve. In the confusion matrix, we should not always look for accuracy but also for sensitivity and specificity. ROC and AUC curve is plotted.
- Evaluating model accuracy using confusion matrix:
There are 0 Type 2 errors i.e Fail to reject it when it is false. Also, there are 3 Type 1 errors i.e rejecting it when it is true.
- ROC curve:
In ROC curve, the more the area under the curve, the better the model.
- ROC-AUC Curve:
AUC is 0.7333, so the more AUC is, the better the model performs.