Regression Analysis in R Programming
In statistics, Logistic Regression is model that takes response variables (dependent variable) and features (independent variables) to determine estimated probability of an event. Logistic model is used when response variable has categorical values such as 0 or 1. For example, a student will pass/fail, a mail is spam or not, determining the images, etc. In this article, we’ll discuss about regression analysis, types of regression and implementation of logistic regression in R programming.
Regression analysis is a group of statistical processes used in R programming and statistics to determine the relationship between dataset variables. Generally, regression analysis is used to determine the relationship between the dependent and independent variables of the dataset. Regression analysis helps to understand how dependent variables changes when one of the independent variable is changes and other independent variables are kept constant. This helps in building a regression model and further, helps in forecasting the values with respect to a change in one of the independent variables. On the basis of types of dependent variables, number of independent variables and the shape of the regression line, there are 4 types of regression analysis techniques i.e., Linear Regression, Logistic Regression, Multinomial Logistic Regression and Ordinal Logistic Regression.
Types of Regression Analysis
Linear Regression is one of the most widely used regression techniques to model the relationship between two variables. It uses a linear relationship to model the regression line. There are 2 variables used in the linear relationship equation i.e., predictor variable and response variable.
y = ax + b
y is the response variable
x is the predictor variable
a and b are the coefficients
The regression line created using this technique is a straight line. The response variable is derived from predictor variables. Predictor variables are estimated using some statistical experiments. Linear regression is widely used but these techniques is not capable of predicting the probability.
On the other hand, logistic regression has an advantage over linear regression as it is capable of predicting the values within the range. Logistic regression is used to predict the values within the categorical range. For example, male or female, winner or loser, etc.
Logistic regression uses the following sigmoidal function-
y represents response variable
z represents equation of independent variables or features
Multinomial Logistic Regression
Multinomial logistic regression is an advanced technique of logistic regression which takes more than 2 categorical variables unlike, in logistic regression which takes 2 categorical variables. For example, a biology researcher found a new type of species and type of species can be determined on many factors such as size, shape, eye color, the environmental factor of its living, etc.
Ordinal Logistic Regression
Ordinal logistic regression is also an extension to logistic regression. It is used to predict the values as different levels of category (ordered). In simple words, it predicts the rank. For example, a survey of taste quality of food is created by a restaurant and using ordinal logistic regression, a survey response variable can be created on a scale of any interval such 1-10 which helps in determining the customer’s response to their food items.
Implementation of Logistic Regression in R programming
In R language, logistic regression model is created using
Syntax:glm(formula, family = binomial)
formula: represents an equation on the basis of which model has to be fitted.
family: represents the type of function to be used i.e., binomial for logistic regression
To know about more optional parameters of glm() function, use below command in R:
Let us assume a vector of IQ level of students in a class. Another vector contains the result of the corresponding student i.e., fail or pass (0 or 1) in an exam.
Above code produces the following result
IQ result 1 25.46872 0 2 26.72004 0 3 27.16163 0 4 27.55291 1 5 27.72577 0 6 28.00731 0 7 28.18095 0 8 28.28053 0 9 28.29086 0 10 28.34474 1 11 28.35581 1 12 28.40969 0 13 28.72583 0 14 28.81105 0 15 28.87337 1 16 29.00383 1 17 29.01762 0 18 29.03629 0 19 29.18109 1 20 29.39251 0 21 29.40852 0 22 29.78844 0 23 29.80456 1 24 29.81815 0 25 29.86478 0 26 29.91535 1 27 30.04204 1 28 30.09565 0 29 30.28495 1 30 30.39359 1 31 30.78886 1 32 30.79307 1 33 30.98601 1 34 31.14602 0 35 31.48225 1 36 31.74983 1 37 31.94705 1 38 31.94772 1 39 33.63058 0 40 35.35096 1 Call: glm(formula = result ~ IQ, family = binomial, data = df) Deviance Residuals: Min 1Q Median 3Q Max -2.1451 -0.9742 -0.4950 1.0326 1.7283 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -16.8093 7.3368 -2.291 0.0220 * IQ 0.5651 0.2482 2.276 0.0228 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 55.352 on 39 degrees of freedom Residual deviance: 48.157 on 38 degrees of freedom AIC: 52.157 Number of Fisher Scoring iterations: 4