Open In App
Related Articles

Regression with Categorical Variables in R Programming

Improve Article
Save Article
Like Article

Regression is a multi-step process for estimating the relationships between a dependent variable and one or more independent variables also known as predictors or covariates. Regression analysis is mainly used for two conceptually distinct purposes: for prediction and forecasting, where its use has substantial overlap with the field of machine learning and second it sometimes can be used to infer relationships between the independent and dependent variables.

Regression with Categorical Variables

Categorical Variables are variables that can take on one of a limited and fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. They are also known as a factor or qualitative variables. The type of regression analysis that fits best with categorical variables is Logistic Regression. Logistic regression uses Maximum Likelihood Estimation to estimate the parameters. It derives the relationship between a set of variables(independent) and a categorical variable(dependent). It is very much easier to implement a regression model by using the R language because of its excellent libraries inside it. Now, let’s try to set up a logistic regression model with categorical variables for better understanding.

Example: The objective is to predict whether a candidate will get admitted to a university with variables such as gre, gpa, and rank. The R script is provided side by side and is commented for better understanding of the user. The data is in .csv format. We will get the working directory with getwd() function and place out datasets binary.csv inside it to proceed further. Please download the CSV file here.


# preparing the dataset 
data <- read.csv("binary.csv"


'data.frame':    400 obs. of  4 variables:
 $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
 $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
 $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
 $ rank : int  3 3 1 4 4 2 1 2 3 2 ...

Looking at the structure of the datasets we can observe that it has 4 variables, where admit tells whether a candidate will get admitted or not admitted (1 if admitted and 0 if not admitted) gre, gpa, and rank give the candidates gre score, his/her gpa in the previous college and previous college rank respectively.   We use admit as the dependent variable and gre, gpa, and rank as the independent variables. Now, note that admit and rank are categorical variables but are of numeric type. But in order to use them as categorical variables in our model, we will use as.factor() function to convert them into factor variables.


# converting admit and rank
# columns into factor variables
data$admit = as.factor(data$admit)
data$rank = as.factor(data$rank)
# two-way table of factor
# variable
xtabs(~admit + rank, data = data)


admit  1  2  3  4
    0 28 97 93 55
    1 33 54 28 12

Now divide the data into a training set and test set. The training set is used to find the relationship between dependent and independent variables while the test set analyses the performance of the model. We use 60% of the dataset as a training set. The assignment of the data to training and test set is done using random sampling. We perform random sampling on R using sample() function. Use set.seed() to generate the same random sample every time and maintain consistency.


# Partitioning of data
data1<-sample(2, nrow(data), 
                   replace = T, 
                 prob = c(0.6, 0.4))
train<-data[data1 == 1,]
test<-data[data1 == 2,]

Now build a logistic regression model for our data. glm() function helps us to establish a neural network for our data. The glm() function we are using here has the following syntax.


glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset,

      control = list(…), model = TRUE, method = “”, x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, …)

formulaa symbolic description of the model to be fitted.
familya description of the error distribution and link function to be used in the model.
dataan optional data frame.
weightsan optional vector of ‘prior weights’ to be used in the fitting process. Should be NULL or a numeric vector.
subsetan optional vector specifying a subset of observations to be used in the fitting process.
na.actiona function which indicates what should happen when the data contain NAs.
startstarting values for the parameters in the linear predictor.
etastartstarting values for the linear predictor.
mustartstarting values for the vector of means.
offsetthis can be used to specify an a priori known component to be included in the linear predictor during fitting. 
controla list of parameters for controlling the fitting process.
modela logical value indicating whether model frame should be included as a component of the returned value.
methodthe method to be used in fitting the model.

 logical values indicating whether the response vector and model matrix used in the fitting process should be returned as

 components of the returned value.

singular.oklogical; if FALSE a singular fit is an error.
contrastsan optional list.
 arguments to be used to form the default control argument if it is not supplied directly.


mymodel<-glm(admit~gre + gpa + rank, 
                        data = train, 
                        family = 'binomial')


glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6576  -0.8724  -0.6184   1.0683   2.1035  

             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -4.972329   1.518865  -3.274  0.00106 **
gre          0.001449   0.001405   1.031  0.30270   
gpa          1.233117   0.450550   2.737  0.00620 **
rank2       -0.784080   0.406376  -1.929  0.05368 . 
rank3       -1.203013   0.426614  -2.820  0.00480 **
rank4       -1.699652   0.536974  -3.165  0.00155 **
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 312.66  on 248  degrees of freedom
Residual deviance: 283.38  on 243  degrees of freedom
AIC: 295.38

Number of Fisher Scoring iterations: 4

From the summary of the model it is evident that gre has no significant role in predictions, so we can remove it from our model and rewrite it as follows:


mymodel<-glm(admit~gpa + rank, 
                  data = train, 
                 family = 'binomial')

 Now, let’s try to analyze our regression model by making some predictions.


# Prediction
p1<-predict(mymodel, train, 
            type = 'response')


        1         7         8        10        12        13 
0.3013327 0.3784012 0.2414806 0.5116852 0.4610888 0.7211702 




   admit gre  gpa rank
1      0 380 3.61    3
7      1 560 2.98    1
8      0 400 3.08    2
10     0 700 3.92    2
12     0 440 3.22    1
13     1 760 4.00    1

Then, we round up our results by creating a confusion matrix to compare the number of true/false positives and negatives. We will form a confusion matrix with training data.


# confusion Matrix 
# $Misclassification error -Training data  
pre1<-ifelse(p1 > 0.5, 1, 0)
table<-table(Prediction = pre1, 
             Actual = train$admit) 


Prediction   0   1
         0 158  55
         1  11  25

The model generates 158 true negatives (0’s), 25 true positives (1’s), while there are 11 false negatives and 55 false positives. Now, let’s calculate the misclassification error (for training data) which  {1 – classification error}


1 - sum(diag(table)) / sum(table)


[1] 0.2650602

The misclassification error comes out to be 24.9%. In this, we can use regression techniques with categorical variables to various other data.

Regression Analysis is a very efficient method and there are numerous types of regression models that one can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit like logistic regression is best suited for categorical variables.

Last Updated : 12 Oct, 2020
Like Article
Save Article
Similar Reads