Open In App

How to interpret odds ratios in logistic regression

Logistic regression is a statistical method used to model the relationship between a binary outcome and predictor variables. This article provides an overview of logistic regression, including its assumptions and how to interpret regression coefficients.

Assumptions of logistic regression

Why interpreting regression coefficients is difficult?

Interpreting regression coefficients in logistic regression can be complex due to several factors:

Logistic Regression Model

The logistic regression model allows us to:

Modeling the Logit-Transformed Probability:

The model equation is: [Tex]logit(P(Y=1)) = β_₀ + β_₁X_₁ + β_₂X_₂ + ... + β_kX_k[/Tex]

Where, P(Y=1) is the probability of the outcome being 1 (success), and X₁, X₂, ..., Xₖ are the predictor variables.

Maximum Likelihood Estimation:

Formula for the Probability:

The formula for the probability of the outcome being 1 (success) is:

P(Y=1) = 1 / (1 + e^(-z))

Where, z is the linear combination of predictor variables and their coefficients.

What is Odds ratio(OR)?

The odds ratio (OR) is a statistical measure used in logistic regression to quantify the strength and direction of the association between a predictor variable and an outcome variable. It represents the ratio of the odds of the outcome occurring in one group compared to the odds of the outcome occurring in another group, or for a one-unit increase in the predictor variable.

Formula:

Odds Ratio = Odds of outcome in group 1 / Odds of outcome in group 2

For example there are 200 individuals regularly exercise, among whom 20 develop heart disease, 150 individuals do not regularly exercise, among whom 30 develop heart disease.

To calculate the odds ratio (OR) for developing lung cancer between smokers and nonsmokers:

For individuals who exercise regularly:

Odds of heart disease for exercisers =

Number of exercisers with heart disease / Number of exercisers without heart disease

= 20 / 180

= 0.1111

For individuals who do not exercise regularly:

Odds of heart disease for non-exercisers =

Number of non-exercisers with heart disease / Number of non-exercisers without heart disease

= 30/120

= 0.25

Calculate the odds ratio:

OR = Odds of heart disease for non-exercisers / Odds of heart disease for exercisers

= 0.25 / 0.1111

≈ 2.25

The odds of developing heart disease among individuals who do not regularly exercise are approximately 2.25 times higher than the odds among those who exercise regularly.

Interpreting odds ratios in logistic regression

Interpreting odds ratios in logistic regression involves understanding how changes in predictor variables affect the odds of the outcome variable occurring.

Step 1: Understand the Odds Ratio

The odds ratio (OR) represents the ratio of the odds of the event occurring in one group compared to the odds of it occurring in another group. In logistic regression, it's calculated for each predictor variable.

Step 2: Examine the Significance

Before interpreting the odds ratio, check if it's statistically significant. This is usually indicated by the p-value associated with the odds ratio. A low p-value (typically < 0.05) suggests that the odds ratio is significant.

Step 3: Interpretation of Odds Ratio

​In logistic regression:

Step 4: Magnitude of the Odds Ratio:

The magnitude of the odds ratio indicates the strength of the association between the predictor and the outcome. Larger values (either greater than 1 or less than 1) suggest a stronger association.

Step 5: Direction of Association

Pay attention to whether the odds ratio is greater than 1 or less than 1. This indicates the direction of the association between the predictor and the outcome.

Step 6: Consider the Context

Interpretation should also consider the context of the study and the variables involved. Sometimes, associations may be influenced by confounding variables or other factors not captured in the model.

Example 1:

# Load necessary libraries
library(dplyr)
library(ggplot2)

# Simulated data
set.seed(123)
n <- 1000
age <- rnorm(n, mean = 50, sd = 10)
smoking <- rbinom(n, size = 1, prob = 0.3)
lung_cancer <- rbinom(n, size = 1, prob = plogis(0.1 * age + 0.5 * smoking))

# Create a dataframe
data <- data.frame(age = age, smoking = smoking, lung_cancer = lung_cancer)

# Fit logistic regression model
model <- glm(lung_cancer ~ age + smoking, data = data, family = binomial)

# Display summary of the model
summary(model)

# Extract odds ratios
odds_ratios <- exp(coef(model))

# Print the odds ratios
print(odds_ratios)

Output:

Call:
glm(formula = lung_cancer ~ age + smoking, family = binomial,
data = data)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.470e+00 1.841e+00 1.885 0.0595 .
age 1.999e-02 3.753e-02 0.533 0.5944
smoking 1.710e+01 1.664e+03 0.010 0.9918
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 93.189 on 999 degrees of freedom
Residual deviance: 87.007 on 997 degrees of freedom
AIC: 93.007

Number of Fisher Scoring iterations: 20

> print(odds_ratios)
(Intercept) age smoking
3.214450e+01 1.020189e+00 2.666466e+07

Odd Ratio Interpretation:

The odds ratios in logistic regression represent the multiplicative change in the odds of the dependent variable for a one-unit change in the predictor variable, holding all other variables constant. Here's a more detailed explanation:

  1. Intercept:
    • The intercept odds ratio is 32.14. This means that when all other variables are held constant (i.e., age and smoking), the odds of having lung cancer for a person with all predictor variables at 0 (which might not be practically meaningful depending on the context) is 32.14 times higher than the odds of not having lung cancer.
  2. Age:
    • The odds ratio for age is 1.0202. This means that for each additional year of age, the odds of having lung cancer increase by approximately 2.02% when all other variables are held constant.
  3. Smoking:
    • The odds ratio for smoking is 26,664,660. This value is very large and indicates that there is a massive increase in the odds of having lung cancer for smokers compared to non-smokers. However, the large standard error and lack of statistical significance suggest that this result may not be reliable in this model

Example 2:

In this example we use a real dataser called mtcars in R. Then we create a logistic regression model to predict whether a car has high or low mileage per gallon (mpg) based on its weight and whether it has an automatic or manual transmission.

# Load necessary libraries
library(dplyr)

# Load the mtcars dataset
data(mtcars)

# Convert mpg to a binary variable (high/low) using median as threshold
mtcars$mpg_high <- ifelse(mtcars$mpg > median(mtcars$mpg), 1, 0)

# Fit logistic regression model
model <- glm(mpg_high ~ wt + am, data = mtcars, family = binomial)

# Display summary of the model
summary(model)

# Extract odds ratios
odds_ratios <- exp(coef(model))

# Print the odds ratios
print(odds_ratios)

# Predict probabilities for automatic and manual transmissions
pred_automatic <- predict(model, newdata = data.frame(wt = seq(min(mtcars$wt), max(mtcars$wt), length.out = 100), am = 0), type = "response")
pred_manual <- predict(model, newdata = data.frame(wt = seq(min(mtcars$wt), max(mtcars$wt), length.out = 100), am = 1), type = "response")

# Plot predicted probabilities
plot(mtcars$wt, mtcars$mpg_high, xlab = "Weight (wt)", ylab = "Probability of high MPG", pch = 16, col = ifelse(mtcars$am == 0, "blue", "red"))
lines(seq(min(mtcars$wt), max(mtcars$wt), length.out = 100), pred_automatic, col = "blue", lty = 1)
lines(seq(min(mtcars$wt), max(mtcars$wt), length.out = 100), pred_manual, col = "red", lty = 1)
legend("topright", legend = c("Automatic", "Manual"), col = c("blue", "red"), lty = 1, pch = 16)

Output:

Call:
glm(formula = mpg_high ~ wt + am, family = binomial, data = mtcars)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 641.47 431780.56 0.001 0.999
wt -193.01 129407.16 -0.001 0.999
am -63.35 90144.23 -0.001 0.999

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 4.4236e+01 on 31 degrees of freedom
Residual deviance: 3.7212e-09 on 29 degrees of freedom
AIC: 6

Number of Fisher Scoring iterations: 25

> print(odds_ratios)
(Intercept) wt am
3.858022e+278 1.496441e-84 3.061869e-28


RGui-(64-bit)-22-03-2024-17_51_32

Odd Ratio Interpretation:

  1. Intercept:
    • The odds ratio for the intercept is 3.858022e+278. This extremely large value indicates that the odds of mpg_high when all predictors are zero (which might not be practically meaningful depending on the context) are astronomically high compared to not having mpg_high. However, such a high odds ratio is likely due to some issue with the model, such as separation or complete separation, where the outcome variable perfectly predicts the predictor variable.
  2. wt (Weight):
    • The odds ratio for wt is 1.496441e-84. This value is very close to zero, indicating that for each unit increase in weight (wt), the odds of mpg_high decrease exponentially close to zero. In other words, as the weight increases, the odds of having high miles per gallon (mpg) decrease dramatically, assuming all other variables are held constant.
  3. am (Transmission Type):
    • The odds ratio for am is 3.061869e-28. Similar to wt, this value is very close to zero, indicating that having an automatic transmission (am = 0) dramatically decreases the odds of mpg_high compared to having a manual transmission (am = 1), assuming all other variables are held constant.

Conclusion

Interpreting odds ratios in logistic regression entails understanding how predictor variables influence the likelihood of the outcome. A ratio greater than 1 indicates a positive association, implying higher odds of the outcome, while a ratio less than 1 signifies a negative association, suggesting lower odds. Significance, direction, and magnitude are crucial considerations, with significant values indicating a stronger relationship.

Article Tags :