Open In App

Logistic Regression in R Programming

Logistic regression in R Programming is a classification algorithm used to find the probability of event success and event failure. Logistic regression is used when the dependent variable is binary(0/1, True/False, Yes/No) in nature. The logit function is used as a link function in a binomial distribution. 

A binary outcome variable’s probability can be predicted using the statistical modeling technique known as logistic regression. It is widely employed in many different industries, including marketing, finance, social sciences, and medical research.



The logistic function, commonly referred to as the sigmoid function, is the basic idea underpinning logistic regression. This sigmoid function is used in logistic regression to describe the correlation between the predictor variables and the likelihood of the binary outcome.
 

Logistic Regression in R Programming

 Logistic regression is also known as Binomial logistics regression. It is based on the sigmoid function where output is probability and input can be from -infinity to +infinity.



Theory

Logistics regression is also known as a generalized linear model. As it is used as a classification technique to predict a qualitative response, the Value of y ranges from 0 to 1 and can be represented by the following equation: 

Logistic Regression in R Programming

p is the probability of characteristic of interest. The odds ratio is defined as the probability of success in comparison to the probability of failure. It is a key representation of logistic regression coefficients and can take values between 0 and infinity. The odds ratio of 1 is when the probability of success is equal to the probability of failure. The odds ratio of 2 is when the probability of success is twice the probability of failure. The odds ratio of 0.5 is when the probability of failure is twice the probability of success. 

Logistic Regression in R Programming

 Since we are working with a binomial distribution(dependent variable), we need to choose a link function that is best suited for this distribution. 

Logistic Regression in R Programming

It is a logit function. In the equation above, the parenthesis is chosen to maximize the likelihood of observing the sample values rather than minimizing the sum of squared errors(like ordinary regression). The logit is also known as a log of odds. The logit function must be linearly related to the independent variables. This is from equation A, where the left-hand side is a linear combination of x. This is similar to the OLS assumption that y be linearly related to x. Variables b0, b1, b2 … etc are unknown and must be estimated on available training data. In a logistic regression model, multiplying b1 by one unit changes the logit by b0. The P changes due to a one-unit change will depend upon the value multiplied. If b1 is positive then P will increase and if b1 is negative then P will decrease.

The Dataset

mtcars(motor trend car road test) comprises fuel consumption, performance, and 10 aspects of automobile design for 32 automobiles. It comes pre-installed with dplyr package in R. 




# Installing the package
install.packages("dplyr")
 
# Loading package
library(dplyr)
 
# Summary of dataset in package
summary(mtcars)

Performing Logistic regression on a dataset

Logistic regression is implemented in R using glm() by training the model using features or variables in the dataset. 




# Installing the package
 
# For Logistic regression
install.packages("caTools")
 
# For ROC curve to evaluate model
install.packages("ROCR")    
     
# Loading package
library(caTools)
library(ROCR)

Splitting the Data




# Splitting dataset
split <- sample.split(mtcars, SplitRatio = 0.8)
split
 
train_reg <- subset(mtcars, split == "TRUE")
test_reg <- subset(mtcars, split == "FALSE")
 
# Training model
logistic_model <- glm(vs ~ wt + disp,
                    data = train_reg,
                    family = "binomial")
logistic_model
 
# Summary
summary(logistic_model)

Output:

Call:
glm(formula = vs ~ wt + disp, family = "binomial", data = train_reg)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6552  -0.4051   0.4446   0.6180   1.9191  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.58781    2.60087   0.610   0.5415  
wt           1.36958    1.60524   0.853   0.3936  
disp        -0.02969    0.01577  -1.882   0.0598 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 34.617  on 24  degrees of freedom
Residual deviance: 20.212  on 22  degrees of freedom
AIC: 26.212

Number of Fisher Scoring iterations: 6

Predict test data based on model




predict_reg <- predict(logistic_model,
                       test_reg, type = "response")
predict_reg

Output:

Hornet Sportabout         Merc 280C        Merc 450SE Chrysler Imperial 
       0.01226166        0.78972164        0.26380531        0.01544309 
      AMC Javelin        Camaro Z28    Ford Pantera L 
       0.06104267        0.02807992        0.01107943 




# Changing probabilities
predict_reg <- ifelse(predict_reg >0.5, 1, 0)
 
# Evaluating model accuracy
# using confusion matrix
table(test_reg$vs, predict_reg)
 
missing_classerr <- mean(predict_reg != test_reg$vs)
print(paste('Accuracy =', 1 - missing_classerr))
 
# ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_reg$vs)
ROCPer <- performance(ROCPred, measure = "tpr",
                      x.measure = "fpr")
 
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc
 
# Plotting curve
plot(ROCPer)
plot(ROCPer, colorize = TRUE,
     print.cutoffs.at = seq(0.1, by = 0.1),
     main = "ROC CURVE")
abline(a = 0, b = 1)
 
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

Output:

ROC Curve

Example 2:

We can perform a logistic regression model Titanic Data set in R.




# Load the dataset
data(Titanic)
 
# Convert the table to a data frame
data <- as.data.frame(Titanic)
 
# Fit the logistic regression model
model <- glm(Survived ~ Class + Sex + Age, family = binomial, data = data)
 
# View the summary of the model
summary(model)

Output:

Call:
glm(formula = Survived ~ Class + Sex + Age, family = binomial, 
    data = data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.177  -1.177   0.000   1.177   1.177  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  4.022e-16  8.660e-01       0        1
Class2nd    -9.762e-16  1.000e+00       0        1
Class3rd    -4.699e-16  1.000e+00       0        1
ClassCrew   -5.551e-16  1.000e+00       0        1
SexFemale   -3.140e-16  7.071e-01       0        1
AgeAdult     5.103e-16  7.071e-01       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 44.361  on 31  degrees of freedom
Residual deviance: 44.361  on 26  degrees of freedom
AIC: 56.361

Number of Fisher Scoring iterations: 2

Plot the ROC curve for the Titanic data set




# Install and load the required packages
install.packages("ROCR")
library(ROCR)
 
# Fit the logistic regression model
model <- glm(Survived ~ Class + Sex + Age, family = binomial, data = data)
 
# Make predictions on the dataset
predictions <- predict(model, type = "response")
 
# Create a prediction object for ROCR
prediction_objects <- prediction(predictions, titanic_df$Survived)
 
# Create an ROC curve object
roc_object <- performance(prediction_obj, measure = "tpr", x.measure = "fpr")
 
# Plot the ROC curve
plot(roc_object, main = "ROC Curve", col = "blue", lwd = 2)
 
# Add labels and a legend to the plot
legend("bottomright", legend =
       paste("AUC =", round(performance(prediction_objects, measure = "auc")
                            @y.values[[1]], 2)), col = "blue", lwd = 2)

Output:

ROC curve


Article Tags :