Regression is a multi-step process for estimating the relationships between a dependent variable and one or more independent variables also known as predictors or covariates. Regression analysis is mainly used for two conceptually distinct purposes: for prediction and forecasting, where its use has substantial overlap with the field of machine learning and second it sometimes can be used to infer relationships between the independent and dependent variables.
Regression with Categorical Variables
Categorical Variables are variables that can take on one of a limited and fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. They are also known as a factor or qualitative variables. The type of regression analysis that fits best with categorical variables is Logistic Regression. Logistic regression uses Maximum Likelihood Estimation to estimate the parameters. It derives the relationship between a set of variables(independent) and a categorical variable(dependent). It is very much easier to implement a regression model by using the R language because of its excellent libraries inside it. Now, let’s try to set up a logistic regression model with categorical variables for better understanding.
Example: The objective is to predict whether a candidate will get admitted to a university with variables such as gre, gpa, and rank. The R script is provided side by side and is commented for better understanding of the user. The data is in .csv format. We will get the working directory with getwd() function and place out datasets binary.csv inside it to proceed further. Please download the CSV file here.
R
getwd ()
data <- read.csv ( "binary.csv" )
str (data)
|
Output:
'data.frame': 400 obs. of 4 variables:
$ admit: int 0 1 1 1 0 1 1 0 1 0 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : int 3 3 1 4 4 2 1 2 3 2 ...
Looking at the structure of the datasets we can observe that it has 4 variables, where admit tells whether a candidate will get admitted or not admitted (1 if admitted and 0 if not admitted) gre, gpa, and rank give the candidates gre score, his/her gpa in the previous college and previous college rank respectively. We use admit as the dependent variable and gre, gpa, and rank as the independent variables. Now, note that admit and rank are categorical variables but are of numeric type. But in order to use them as categorical variables in our model, we will use as.factor() function to convert them into factor variables.
R
data$admit = as.factor (data$admit)
data$rank = as.factor (data$rank)
xtabs (~admit + rank, data = data)
|
Output:
rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12
Now divide the data into a training set and test set. The training set is used to find the relationship between dependent and independent variables while the test set analyses the performance of the model. We use 60% of the dataset as a training set. The assignment of the data to training and test set is done using random sampling. We perform random sampling on R using sample() function. Use set.seed() to generate the same random sample every time and maintain consistency.
R
set.seed (1234)
data1<- sample (2, nrow (data),
replace = T,
prob = c (0.6, 0.4))
train<-data[data1 == 1,]
test<-data[data1 == 2,]
|
Now build a logistic regression model for our data. glm() function helps us to establish a neural network for our data. The glm() function we are using here has the following syntax.
Syntax:
glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset,
control = list(…), model = TRUE, method = “glm.fit”, x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, …)
Parameter | Description |
formula | a symbolic description of the model to be fitted. |
family | a description of the error distribution and link function to be used in the model. |
data | an optional data frame. |
weights | an optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector. |
subset | an optional vector specifying a subset of observations to be used in the fitting process. |
na.action | a function which indicates what should happen when the data contain NAs. |
start | starting values for the parameters in the linear predictor. |
etastart | starting values for the linear predictor. |
mustart | starting values for the vector of means. |
offset | this can be used to specify an a priori known component to be included in the linear predictor during fitting. |
control | a list of parameters for controlling the fitting process. |
model | a logical value indicating whether model frame should be included as a component of the returned value. |
method | the method to be used in fitting the model. |
x,y | logical values indicating whether the response vector and model matrix used in the fitting process should be returned as components of the returned value. |
singular.ok | logical; if FALSE a singular fit is an error. |
contrasts | an optional list. |
… | arguments to be used to form the default control argument if it is not supplied directly. |
R
mymodel<- glm (admit~gre + gpa + rank,
data = train,
family = 'binomial' )
summary (mymodel)
|
Output:
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6576 -0.8724 -0.6184 1.0683 2.1035
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.972329 1.518865 -3.274 0.00106 **
gre 0.001449 0.001405 1.031 0.30270
gpa 1.233117 0.450550 2.737 0.00620 **
rank2 -0.784080 0.406376 -1.929 0.05368 .
rank3 -1.203013 0.426614 -2.820 0.00480 **
rank4 -1.699652 0.536974 -3.165 0.00155 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 312.66 on 248 degrees of freedom
Residual deviance: 283.38 on 243 degrees of freedom
AIC: 295.38
Number of Fisher Scoring iterations: 4
From the summary of the model it is evident that gre has no significant role in predictions, so we can remove it from our model and rewrite it as follows:
R
mymodel<- glm (admit~gpa + rank,
data = train,
family = 'binomial' )
|
Now, let’s try to analyze our regression model by making some predictions.
R
p1<- predict (mymodel, train,
type = 'response' )
head (p1)
|
Output:
1 7 8 10 12 13
0.3013327 0.3784012 0.2414806 0.5116852 0.4610888 0.7211702
Output:
admit gre gpa rank
1 0 380 3.61 3
7 1 560 2.98 1
8 0 400 3.08 2
10 0 700 3.92 2
12 0 440 3.22 1
13 1 760 4.00 1
Then, we round up our results by creating a confusion matrix to compare the number of true/false positives and negatives. We will form a confusion matrix with training data.
R
pre1<- ifelse (p1 > 0.5, 1, 0)
table<- table (Prediction = pre1,
Actual = train$admit)
table
|
Output:
Actual
Prediction 0 1
0 158 55
1 11 25
The model generates 158 true negatives (0’s), 25 true positives (1’s), while there are 11 false negatives and 55 false positives. Now, let’s calculate the misclassification error (for training data) which {1 – classification error}
R
1 - sum ( diag (table)) / sum (table)
|
Output:
[1] 0.2650602
The misclassification error comes out to be 24.9%. In this, we can use regression techniques with categorical variables to various other data.
Regression Analysis is a very efficient method and there are numerous types of regression models that one can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit like logistic regression is best suited for categorical variables.