Regression with Categorical Variables in R Programming

Regression is a multi-step process for estimating the relationships between a dependent variable and one or more independent variables also known as predictors or covariates. Regression analysis is mainly used for two conceptually distinct purposes: for prediction and forecasting, where its use has substantial overlap with the field of machine learning and second it sometimes can be used to infer relationships between the independent and dependent variables.

Regression with Categorical Variables

Categorical Variables are variables that can take on one of a limited and fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. They are also known as a factor or qualitative variables. The type of regression analysis that fits best with categorical variables is Logistic Regression. Logistic regression uses Maximum Likelihood Estimation to estimate the parameters. It derives the relationship between a set of variables(independent) and a categorical variable(dependent). It is very much easier to implement a regression model by using the R language because of its excellent libraries inside it. Now, let’s try to set up a logistic regression model with categorical variables for better understanding.

Example: The objective is to predict whether a candidate will get admitted to a university with variables such as gre, gpa, and rank. The R script is provided side by side and is commented for better understanding of the user. The data is in .csv format. We will get the working directory with getwd() function and place out datasets binary.csv inside it to proceed further. Please download the CSV file here.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# preparing the dataset 
getwd() 
data <- read.csv("binary.csv"
str(data)

chevron_right


Output:



'data.frame':    400 obs. of  4 variables:
 $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
 $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
 $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
 $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

Looking at the structure of the datasets we can observe that it has 4 variables, where admit tells whether a candidate will get admitted or not admitted (1 if admitted and 0 if not admitted) gre, gpa, and rank give the candidates gre score, his/her gpa in the previous college and previous college rank respectively.   We use admit as the dependent variable and gre, gpa, and rank as the independent variables. Now, note that admit and rank are categorical variables but are of numeric type. But in order to use them as categorical variables in our model, we will use as.factor() function to convert them into factor variables.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# converting admit and rank
# columns into factor variables
data$admit = as.factor(data$admit)
data$rank = as.factor(data$rank)
  
# two-way table of factor
# variable
xtabs(~admit + rank, data = data)

chevron_right


Output:

     rank
admit  1  2  3  4
    0 28 97 93 55
    1 33 54 28 12

Now divide the data into a training set and test set. The training set is used to find the relationship between dependent and independent variables while the test set analyses the performance of the model. We use 60% of the dataset as a training set. The assignment of the data to training and test set is done using random sampling. We perform random sampling on R using sample() function. Use set.seed() to generate the same random sample every time and maintain consistency.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# Partitioning of data
set.seed(1234)
data1<-sample(2, nrow(data), 
                   replace = T, 
                 prob = c(0.6, 0.4))
train<-data[data1 == 1,]
test<-data[data1 == 2,]

chevron_right


Now build a logistic regression model for our data. glm() function helps us to establish a neural network for our data. The glm() function we are using here has the following syntax.

Syntax:

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset,



      control = list(…), model = TRUE, method = “glm.fit”, x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, …)

Parameter Description
formula a symbolic description of the model to be fitted.
family a description of the error distribution and link function to be used in the model.
data an optional data frame.
weights an optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector.
subset an optional vector specifying a subset of observations to be used in the fitting process.
na.action a function which indicates what should happen when the data contain NAs.
start starting values for the parameters in the linear predictor.
etastart starting values for the linear predictor.
mustart starting values for the vector of means.
offset this can be used to specify an a priori known component to be included in the linear predictor during fitting. 
control a list of parameters for controlling the fitting process.
model a logical value indicating whether model frame should be included as a component of the returned value.
method the method to be used in fitting the model.
x,y

 logical values indicating whether the response vector and model matrix used in the fitting process should be returned as

 components of the returned value.

singular.ok logical; if FALSE a singular fit is an error.
contrasts an optional list.
 arguments to be used to form the default control argument if it is not supplied directly.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

mymodel<-glm(admit~gre + gpa + rank, 
                        data = train, 
                        family = 'binomial')
summary(mymodel)

chevron_right


Output:

Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6576  -0.8724  -0.6184   1.0683   2.1035  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -4.972329   1.518865  -3.274  0.00106 **
gre          0.001449   0.001405   1.031  0.30270   
gpa          1.233117   0.450550   2.737  0.00620 **
rank2       -0.784080   0.406376  -1.929  0.05368 . 
rank3       -1.203013   0.426614  -2.820  0.00480 **
rank4       -1.699652   0.536974  -3.165  0.00155 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 312.66  on 248  degrees of freedom
Residual deviance: 283.38  on 243  degrees of freedom
AIC: 295.38

Number of Fisher Scoring iterations: 4

From the summary of the model it is evident that gre has no significant role in predictions, so we can remove it from our model and rewrite it as follows:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

mymodel<-glm(admit~gpa + rank, 
                  data = train, 
                 family = 'binomial')

chevron_right


 Now, let’s try to analyze our regression model by making some predictions.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# Prediction
p1<-predict(mymodel, train, 
            type = 'response')
head(p1)

chevron_right


Output:



        1         7         8        10        12        13 
0.3013327 0.3784012 0.2414806 0.5116852 0.4610888 0.7211702 

R

filter_none

edit
close

play_arrow

link
brightness_4
code

head(train)

chevron_right


Output:

   admit gre  gpa rank
1      0 380 3.61    3
7      1 560 2.98    1
8      0 400 3.08    2
10     0 700 3.92    2
12     0 440 3.22    1
13     1 760 4.00    1

Then, we round up our results by creating a confusion matrix to compare the number of true/false positives and negatives. We will form a confusion matrix with training data.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# confusion Matrix 
# $Misclassification error -Training data  
pre1<-ifelse(p1 > 0.5, 1, 0)
table<-table(Prediction = pre1, 
             Actual = train$admit) 
table

chevron_right


Output:

          Actual
Prediction   0   1
         0 158  55
         1  11  25

The model generates 158 true negatives (0’s), 25 true positives (1’s), while there are 11 false negatives and 55 false positives. Now, let’s calculate the misclassification error (for training data) which  {1 – classification error}

R

filter_none

edit
close

play_arrow

link
brightness_4
code

1 - sum(diag(table)) / sum(table)

chevron_right


Output:

[1] 0.2650602

The misclassification error comes out to be 24.9%. In this, we can use regression techniques with categorical variables to various other data.

Regression Analysis is a very efficient method and there are numerous types of regression models that one can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit like logistic regression is best suited for categorical variables.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.