Open In App

Generalized Linear Models Using R

Last Updated : 20 Dec, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

GLM stands for Generalized Linear Models in R Programming Language. It is a flexible framework used in various statistical models, including linear regression, logistic regression, Poisson regression, and many others.

GLMs (Generalized linear models) are a type of statistical model that is extensively used in the analysis of non-normal data, such as count data or binary data. They enable us to describe the connection between one or more predictor variables and a response variable in a flexible manner. This tutorial will go over how to create generalized linear models in the R Programming Language.

Major components of GLMs

  • a probability distribution for the response variable, 
  • a linear predictor function of the predictor variables, and 
  • a link function that connects the linear predictor to the response variable’s mean. 

The probability distribution and link function used is determined by the type of response variable and the research topic at hand. R includes methods for fitting GLMs, such as the glm() function. The user can specify the formula for the model, which contains the response variable and one or more predictor variables, as well as the probability distribution and link function to be used, using this function.

Mathematical Analysis of GLM

A statistical structure known as GLM (Generalized Linear Models) expands the capabilities of the linear regression model to account for non-normal response variables. Defining a probability distribution for the response variable and modelling the correlation between the predictor variables and the response variable’s expected value make up the quantitative analysis of GLM.
Let X be a matrix of predictor factors, and let Y be the response variable. We presume that Y has a normal distribution with mean and variance 2 in a linear regression model. Following that, the connection between Y and X is modelled as:

                                   Y = Xβ + ε

where is a vector of regression coefficients and is a vector of errors that are normally distributed and have a mean and range of 0 and 2, respectively.
We permit non-normal populations in GLM.

                                  g(μ) = Xβ
                                  Var(Y) = φV(μ)

where V() is a variance function from the exponential family, Var(Y) is the variance of Y, g() is the link function, is the expected value of Y, is a scale parameter, and is the link parameter.
to calculate the parameter values for and. By maximising the likelihood function, the maximum likelihood approximation of is obtained:

                                    L(x, y | x, y) = i=1n f(yi | i, y)

where i is the anticipated value of Yi for the ith observation and f() is the probability density function of the exponential family distribution.

GLM model families

There are several GLM model families depending on the make-up of the response variable. These includes three well-known GLM model families:

  • Binomial: The binomial family is used for binary response variables (i.e., two categories) and assumes a binomial distribution.

R




# Fit a logistic regression model using the binomial family
model <- glm(binary_response_variable ~ predictor_variable1 + predictor_variable2,
            family = binomial(link = "logit"), data = data)


  • Gaussian: This family is used for continuous response variables and assumes a normal distribution. The link function for this family is typically the identity function.

R




# Fit a linear regression model using the gaussian family
model <- glm(response_variable ~ predictor_variable1 + predictor_variable2,
            family = gaussian(link = "identity"), data = data)


  • Gamma: The gamma family is used for continuous response variables that are strictly positive and have a skewed distribution.

R




# Fit a gamma regression model using the gamma family
model <- glm(positive_response_variable ~ predictor_variable1 + predictor_variable2,
            family = gamma(link = "inverse"), data = data)


  • Quasibinomial: When a response variable is binary but has a higher variance than would be predicted by a binomial distribution, the quasibinomial model is utilized. This could happen if the response variable has excessive dispersion or additional variation that the model is not taking into account. 

R




# Fit a quasibinomial model
model <- glm(response_variable ~ predictor_variable1 + predictor_variable2,
            family = quasibinomial(), data = data)


Building a Generalized Linear Model

GLM in R programmin

We will use the “mtcars” dataset in R to illustrate the use of generalized linear models. This dataset includes data on different car models, including mpg, horsepower (hp), and weight. (wt). The response variable will be “mpg,” and the predictor factors will be “hp” and “wt.”

R




# Load the mtcars dataset
data(mtcars)
# print Dataset
head(mtcars)


Output:

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

In Generalized linear models (GLM) in R first we load mtcars dataset and print the top 5 records of the dataset.

  • To create a generalized linear model in R, we must first select a suitable probability distribution for the answer variable. 
  • If the answer variable is binary (e.g., 0 or 1), we could use the Bernoulli distribution. 
  • If the response variable is a count (for example, the number of vehicles sold), the Poisson distribution may be used.

To create a generalized linear model in R, use the glm() tool. We must describe the model formula (i.e., the response variable and the predictor variables) as well as the probability distribution family.

R




# Load the mtcars dataset
data(mtcars)
  
# Fit a generalized linear model
model <- glm(mpg ~ hp + wt, data = mtcars, family = gaussian)


The Gaussian family is used in this example, which implies that the response variable has a normal distribution. The glm() function yields an object of class “glm” containing model information such as coefficients and deviance.

Why Gaussian family?

The model may be clearly understood in terms of the mean and variance of the response variable, which is one benefit of employing the Gaussian family. Additionally, the model can be fitted using the well-known and popular statistical technique known as maximum likelihood estimation.

Calculate summary of the model

R




#Calculate summary of the model
summary(model)


Output:

Call:
glm(formula = mpg ~ hp + wt, family = gaussian, data = mtcars)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
hp -0.03177 0.00903 -3.519 0.00145 **
wt -3.87783 0.63273 -6.129 1.12e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 6.725785)

Null deviance: 1126.05 on 31 degrees of freedom
Residual deviance: 195.05 on 29 degrees of freedom
AIC: 156.65

Number of Fisher Scoring iterations: 2

A one-unit hp increase predicts a 0.03177 mpg decrease, while wt increase predicts a 3.87783 mpg decrease. Significance: All coefficients (intercept, hp, wt) are statistically significant, ensuring reliability.

  • Fit: Low residual deviance (195.05) versus null deviance (1126.05) and AIC (156.65) indicate a well-fitting model.
  • Dispersion (6.725785) measures mpg variability, essential for assessing prediction precision. Practical: The model aids understanding and prediction of fuel efficiency, valuable for automotive design and environmental considerations.

Visualize the model

R




# Plot the residual vs fitted values
plot(model, which = 1)
  
# Plot the Q-Q plot of residuals
plot(model, which = 2)


Output:

Generalized Linear Models in R

Generalized Linear Models in R

After creating an extended linear model, we must evaluate its fit to the data. This can be accomplished with the help of diagnostic graphs such as the residual plot and the Q-Q plot. The output is shown above.

  • The residual plot displays the residuals (differences between measured and predicted values) plotted against the fitted values. (i.e. the predicted values). We want to see a random scatter of residuals around zero, which indicates that the model is capturing the data trends.
  • The residuals Q-Q plot displays the residuals plotted against the anticipated values if they were normally distributed. The points should follow a straight line, showing that the residuals are normally distributed.

Conclusion

We covered how to create generalized linear models in R in this article. We showed the process using the “mtcars” dataset and the variables “mpg”, “hp”, and “wt”. We also discussed the significance of selecting a suitable probability distribution for the response variable and how to evaluate model fit using diagnostic plots. Generalized linear models are a versatile instrument for modeling non-normal data and can be used in a variety of settings.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads