How to interpret odds ratios in logistic regression

Logistic regression is a statistical method used to model the relationship between a binary outcome and predictor variables. This article provides an overview of logistic regression, including its assumptions and how to interpret regression coefficients.

Assumptions of logistic regression

Binary Outcome: Logistic regression assumes that the outcome variable is binary, meaning it has only two possible outcomes like yes/no or success/failure.
Independence of Observations: Observations are independent of each other. This means that the outcome of one observation does not influence the outcome of another observation.
Linearity of Log Odds: The relationship between the predictor variables and the log odds of the outcome is linear. While the relationship between predictors and the outcome itself does not need to be linear, the relationship between predictors and the log odds of the outcome should be linear.
Absence of Multicollinearity: Logistic regression assumes that there is no multicollinearity among the predictor variables. Multicollinearity occurs when predictor variables are highly correlated with each other, which can lead to unstable estimates of the coefficients.
Large Sample Size: Logistic regression performs best with a large sample size. While there is no strict rule for the minimum sample size, having a sufficiently large sample size ensures stable parameter estimates and reliable inference.

Why interpreting regression coefficients is difficult?

Interpreting regression coefficients in logistic regression can be complex due to several factors:

Non-linearity: While logistic regression assumes a linear relationship between predictors and the log odds of the outcome, this doesn't imply linearity in the original predictor-outcome relationship. Consequently, interpreting coefficients directly as in linear regression can be misleading.
Log-odds scale: Logistic regression coefficients represent the change in log odds of the outcome per unit change in the predictor. Converting these changes back to probabilities, which are typically easier to interpret, requires additional steps such as exponentiation.
Interaction effects: Logistic regression allows for interaction effects between predictors, making interpretation more nuanced. Interactions can modify the effect of one predictor on the outcome based on the value of another predictor, complicating straightforward interpretation.
Categorical predictors: When predictors are categorical, logistic regression assigns coefficients representing the difference in log odds between each category and a reference category. This necessitates careful interpretation, especially if there are multiple categories.
Collinearity: High collinearity among predictors can inflate standard errors and make coefficient interpretation unreliable. Resolving collinearity issues, such as through variable selection or regularization techniques, is essential for accurate interpretation.

Logistic Regression Model

The logistic regression model allows us to:

Estimate the probability of the outcome occurring given the values of the predictor variables.
Assess the significance and direction of the relationships between predictor variables and the outcome.
Make predictions about the probability of the outcome based on new observations of the predictor variables.

Modeling the Logit-Transformed Probability:

In logistic regression, the logit-transformed probability is modeled as a linear combination of predictor variables.
The logit transformation converts the probability of the outcome into a linear relationship with predictor variables, facilitating modeling.

The model equation is: [Tex]logit(P(Y=1)) = β_₀ + β_₁X_₁ + β_₂X_₂ + ... + β_kX_k[/Tex]

Where, P(Y=1) is the probability of the outcome being 1 (success), and X₁, X₂, ..., Xₖ are the predictor variables.

Maximum Likelihood Estimation:

Maximum likelihood estimation (MLE) is used to estimate the coefficients (β) of the logistic regression model.
MLE finds the values of the coefficients that maximize the likelihood of observing the actual outcomes given the predictor variables.
The likelihood function represents the probability of observing the actual outcomes given the parameter values.
The coefficients are iteratively adjusted until the likelihood of observing the actual outcomes is maximized, using optimization algorithms like Newton-Raphson or gradient descent.

Formula for the Probability:

Once the coefficients are estimated, the probability of the outcome can be calculated using the logistic function.
The logistic function, also known as the sigmoid function, maps the logit-transformed probability to a probability value between 0 and 1.

The formula for the probability of the outcome being 1 (success) is:

P(Y=1) = 1 / (1 + e^(-z))

Where, z is the linear combination of predictor variables and their coefficients.

What is Odds ratio(OR)?

The odds ratio (OR) is a statistical measure used in logistic regression to quantify the strength and direction of the association between a predictor variable and an outcome variable. It represents the ratio of the odds of the outcome occurring in one group compared to the odds of the outcome occurring in another group, or for a one-unit increase in the predictor variable.

Formula:

Odds Ratio = Odds of outcome in group 1 / Odds of outcome in group 2

For example there are 200 individuals regularly exercise, among whom 20 develop heart disease, 150 individuals do not regularly exercise, among whom 30 develop heart disease.

To calculate the odds ratio (OR) for developing lung cancer between smokers and nonsmokers:

For individuals who exercise regularly:

Odds of heart disease for exercisers =

Number of exercisers with heart disease / Number of exercisers without heart disease

= 20 / 180

= 0.1111

For individuals who do not exercise regularly:

Odds of heart disease for non-exercisers =

Number of non-exercisers with heart disease / Number of non-exercisers without heart disease

= 30/120

= 0.25

Calculate the odds ratio:

OR = Odds of heart disease for non-exercisers / Odds of heart disease for exercisers

= 0.25 / 0.1111

≈ 2.25

The odds of developing heart disease among individuals who do not regularly exercise are approximately 2.25 times higher than the odds among those who exercise regularly.

Interpreting odds ratios in logistic regression

Interpreting odds ratios in logistic regression involves understanding how changes in predictor variables affect the odds of the outcome variable occurring.

Step 1: Understand the Odds Ratio

The odds ratio (OR) represents the ratio of the odds of the event occurring in one group compared to the odds of it occurring in another group. In logistic regression, it's calculated for each predictor variable.

Step 2: Examine the Significance

Before interpreting the odds ratio, check if it's statistically significant. This is usually indicated by the p-value associated with the odds ratio. A low p-value (typically < 0.05) suggests that the odds ratio is significant.

Step 3: Interpretation of Odds Ratio

In logistic regression:

OR = 1: No association between predictor and outcome.
OR > 1: Positive association, higher predictor values increase outcome odds.
OR < 1: Negative association, higher predictor values decrease outcome odds.

Step 4: Magnitude of the Odds Ratio:

The magnitude of the odds ratio indicates the strength of the association between the predictor and the outcome. Larger values (either greater than 1 or less than 1) suggest a stronger association.

Step 5: Direction of Association

Pay attention to whether the odds ratio is greater than 1 or less than 1. This indicates the direction of the association between the predictor and the outcome.

Step 6: Consider the Context

Interpretation should also consider the context of the study and the variables involved. Sometimes, associations may be influenced by confounding variables or other factors not captured in the model.

Example 1:

We simulate a dataset with variables age, smoking status, and lung cancer status.
Then fit a logistic regression model using glm() function.
Extract the coefficients from the model using coef() and exponentiate them to obtain the odds ratios using exp().
Finally, interpret the odds ratios to understand the relationship between predictors (age, smoking) and the outcome (lung cancer).

# Load necessary libraries
library(dplyr)
library(ggplot2)

# Simulated data
set.seed(123)
n <- 1000
age <- rnorm(n, mean = 50, sd = 10)
smoking <- rbinom(n, size = 1, prob = 0.3)
lung_cancer <- rbinom(n, size = 1, prob = plogis(0.1 * age + 0.5 * smoking))

# Create a dataframe
data <- data.frame(age = age, smoking = smoking, lung_cancer = lung_cancer)

# Fit logistic regression model
model <- glm(lung_cancer ~ age + smoking, data = data, family = binomial)

# Display summary of the model
summary(model)

# Extract odds ratios
odds_ratios <- exp(coef(model))

# Print the odds ratios
print(odds_ratios)

Output:

Call:
glm(formula = lung_cancer ~ age + smoking, family = binomial, 
    data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) 3.470e+00  1.841e+00   1.885   0.0595 .
age         1.999e-02  3.753e-02   0.533   0.5944  
smoking     1.710e+01  1.664e+03   0.010   0.9918  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 93.189  on 999  degrees of freedom
Residual deviance: 87.007  on 997  degrees of freedom
AIC: 93.007

Number of Fisher Scoring iterations: 20

> print(odds_ratios)
 (Intercept)          age      smoking 
3.214450e+01 1.020189e+00 2.666466e+07

The output shows results of a logistic regression model predicting lung cancer based on age and smoking status.
Estimated coefficients:
- Intercept: borderline significant (p = 0.0595).
- Age: not significant (p = 0.5944).
- Smoking: not significant (p = 0.9918).
Deviance measures assess model fit, with lower values indicating better fit.
AIC provides a measure of model fit, with lower values indicating better fit.
The model converged after 20 iterations.

Odd Ratio Interpretation:

The odds ratios in logistic regression represent the multiplicative change in the odds of the dependent variable for a one-unit change in the predictor variable, holding all other variables constant. Here's a more detailed explanation:

Intercept:
- The intercept odds ratio is 32.14. This means that when all other variables are held constant (i.e., age and smoking), the odds of having lung cancer for a person with all predictor variables at 0 (which might not be practically meaningful depending on the context) is 32.14 times higher than the odds of not having lung cancer.
Age:
- The odds ratio for age is 1.0202. This means that for each additional year of age, the odds of having lung cancer increase by approximately 2.02% when all other variables are held constant.
Smoking:
- The odds ratio for smoking is 26,664,660. This value is very large and indicates that there is a massive increase in the odds of having lung cancer for smokers compared to non-smokers. However, the large standard error and lack of statistical significance suggest that this result may not be reliable in this model

Example 2:

In this example we use a real dataser called mtcars in R. Then we create a logistic regression model to predict whether a car has high or low mileage per gallon (mpg) based on its weight and whether it has an automatic or manual transmission.

Load Data: Load the "mtcars" dataset.
Convert Target Variable: Convert "mpg" to a binary variable indicating high or low mileage based on the median.
Fit Logistic Regression: Use the glm() function to fit a logistic regression model predicting high mileage based on weight (wt) and transmission type (am).
Display Model Summary: Print a summary of the logistic regression model.
Extract Odds Ratios: Exponentiate the coefficients to obtain odds ratios.
Predict Probabilities: Predict probabilities of high mileage for automatic and manual transmissions across a range of weights.
Plot Predicted Probabilities: Create a plot showing predicted probabilities against weight for automatic and manual transmissions.

# Load necessary libraries
library(dplyr)

# Load the mtcars dataset
data(mtcars)

# Convert mpg to a binary variable (high/low) using median as threshold
mtcars$mpg_high <- ifelse(mtcars$mpg > median(mtcars$mpg), 1, 0)

# Fit logistic regression model
model <- glm(mpg_high ~ wt + am, data = mtcars, family = binomial)

# Display summary of the model
summary(model)

# Extract odds ratios
odds_ratios <- exp(coef(model))

# Print the odds ratios
print(odds_ratios)

# Predict probabilities for automatic and manual transmissions
pred_automatic <- predict(model, newdata = data.frame(wt = seq(min(mtcars$wt), max(mtcars$wt), length.out = 100), am = 0), type = "response")
pred_manual <- predict(model, newdata = data.frame(wt = seq(min(mtcars$wt), max(mtcars$wt), length.out = 100), am = 1), type = "response")

# Plot predicted probabilities
plot(mtcars$wt, mtcars$mpg_high, xlab = "Weight (wt)", ylab = "Probability of high MPG", pch = 16, col = ifelse(mtcars$am == 0, "blue", "red"))
lines(seq(min(mtcars$wt), max(mtcars$wt), length.out = 100), pred_automatic, col = "blue", lty = 1)
lines(seq(min(mtcars$wt), max(mtcars$wt), length.out = 100), pred_manual, col = "red", lty = 1)
legend("topright", legend = c("Automatic", "Manual"), col = c("blue", "red"), lty = 1, pch = 16)

Output:

Call:
glm(formula = mpg_high ~ wt + am, family = binomial, data = mtcars)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)    641.47  431780.56   0.001    0.999
wt            -193.01  129407.16  -0.001    0.999
am             -63.35   90144.23  -0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.4236e+01  on 31  degrees of freedom
Residual deviance: 3.7212e-09  on 29  degrees of freedom
AIC: 6

Number of Fisher Scoring iterations: 25

> print(odds_ratios)
  (Intercept)            wt            am 
3.858022e+278  1.496441e-84  3.061869e-28

RGui-(64-bit)-22-03-2024-17_51_32

Shows predicted probabilities of high mileage by weight.
Blue points: Automatic transmission cars.
Red points: Manual transmission cars.
Lines: Predicted probabilities for each transmission type.
Legend: Indicates transmission type.

Odd Ratio Interpretation:

Intercept:
- The odds ratio for the intercept is 3.858022e+278. This extremely large value indicates that the odds of mpg_high when all predictors are zero (which might not be practically meaningful depending on the context) are astronomically high compared to not having mpg_high. However, such a high odds ratio is likely due to some issue with the model, such as separation or complete separation, where the outcome variable perfectly predicts the predictor variable.
wt (Weight):
- The odds ratio for wt is 1.496441e-84. This value is very close to zero, indicating that for each unit increase in weight (wt), the odds of mpg_high decrease exponentially close to zero. In other words, as the weight increases, the odds of having high miles per gallon (mpg) decrease dramatically, assuming all other variables are held constant.
am (Transmission Type):
- The odds ratio for am is 3.061869e-28. Similar to wt, this value is very close to zero, indicating that having an automatic transmission (am = 0) dramatically decreases the odds of mpg_high compared to having a manual transmission (am = 1), assuming all other variables are held constant.

Conclusion

Interpreting odds ratios in logistic regression entails understanding how predictor variables influence the likelihood of the outcome. A ratio greater than 1 indicates a positive association, implying higher odds of the outcome, while a ratio less than 1 signifies a negative association, suggesting lower odds. Significance, direction, and magnitude are crucial considerations, with significant values indicating a stronger relationship.

Article Tags :

AI-ML-DS

Machine Learning

AI-ML-DS With Python