# Regression with Categorical Variables in R Programming

• Last Updated : 12 Oct, 2020

Regression is a multi-step process for estimating the relationships between a dependent variable and one or more independent variables also known as predictors or covariates. Regression analysis is mainly used for two conceptually distinct purposes: for prediction and forecasting, where its use has substantial overlap with the field of machine learning and second it sometimes can be used to infer relationships between the independent and dependent variables.

#### Regression with Categorical Variables

Categorical Variables are variables that can take on one of a limited and fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. They are also known as a factor or qualitative variables. The type of regression analysis that fits best with categorical variables is Logistic Regression. Logistic regression uses Maximum Likelihood Estimation to estimate the parameters. It derives the relationship between a set of variables(independent) and a categorical variable(dependent). It is very much easier to implement a regression model by using the R language because of its excellent libraries inside it. Now, let’s try to set up a logistic regression model with categorical variables for better understanding.

Example: The objective is to predict whether a candidate will get admitted to a university with variables such as gre, gpa, and rank. The R script is provided side by side and is commented for better understanding of the user. The data is in .csv format. We will get the working directory with getwd() function and place out datasets binary.csv inside it to proceed further. Please download the CSV file here.

## R

 `# preparing the dataset ``getwd``() ``data <- ``read.csv``(``"binary.csv"` `) ``str``(data)`

Output:

```'data.frame':    400 obs. of  4 variables:
\$ admit: int  0 1 1 1 0 1 1 0 1 0 ...
\$ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
\$ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
\$ rank : int  3 3 1 4 4 2 1 2 3 2 ...
```

Looking at the structure of the datasets we can observe that it has 4 variables, where admit tells whether a candidate will get admitted or not admitted (1 if admitted and 0 if not admitted) gre, gpa, and rank give the candidates gre score, his/her gpa in the previous college and previous college rank respectively.   We use admit as the dependent variable and gre, gpa, and rank as the independent variables. Now, note that admit and rank are categorical variables but are of numeric type. But in order to use them as categorical variables in our model, we will use as.factor() function to convert them into factor variables.

## R

 `# converting admit and rank``# columns into factor variables``data\$admit = ``as.factor``(data\$admit)``data\$rank = ``as.factor``(data\$rank)`` ` `# two-way table of factor``# variable``xtabs``(~admit + rank, data = data)`

Output:

```     rank
0 28 97 93 55
1 33 54 28 12
```

Now divide the data into a training set and test set. The training set is used to find the relationship between dependent and independent variables while the test set analyses the performance of the model. We use 60% of the dataset as a training set. The assignment of the data to training and test set is done using random sampling. We perform random sampling on R using sample() function. Use set.seed() to generate the same random sample every time and maintain consistency.

## R

 `# Partitioning of data``set.seed``(1234)``data1<-``sample``(2, ``nrow``(data), ``                   ``replace = T, ``                 ``prob = ``c``(0.6, 0.4))``train<-data[data1 == 1,]``test<-data[data1 == 2,]`

Now build a logistic regression model for our data. glm() function helps us to establish a neural network for our data. The glm() function we are using here has the following syntax.

Syntax:

glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset,

control = list(…), model = TRUE, method = “glm.fit”, x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, …)

## R

 `mymodel<-``glm``(admit~gre + gpa + rank, ``                        ``data = train, ``                        ``family = ``'binomial'``)``summary``(mymodel)`

Output:

```Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = train)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.6576  -0.8724  -0.6184   1.0683   2.1035

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.972329   1.518865  -3.274  0.00106 **
gre          0.001449   0.001405   1.031  0.30270
gpa          1.233117   0.450550   2.737  0.00620 **
rank2       -0.784080   0.406376  -1.929  0.05368 .
rank3       -1.203013   0.426614  -2.820  0.00480 **
rank4       -1.699652   0.536974  -3.165  0.00155 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 312.66  on 248  degrees of freedom
Residual deviance: 283.38  on 243  degrees of freedom
AIC: 295.38

Number of Fisher Scoring iterations: 4
```

From the summary of the model it is evident that gre has no significant role in predictions, so we can remove it from our model and rewrite it as follows:

## R

 `mymodel<-``glm``(admit~gpa + rank, ``                  ``data = train, ``                 ``family = ``'binomial'``)`

Now, let’s try to analyze our regression model by making some predictions.

## R

 `# Prediction``p1<-``predict``(mymodel, train, ``            ``type = ``'response'``)``head``(p1)`

Output:

```        1         7         8        10        12        13
0.3013327 0.3784012 0.2414806 0.5116852 0.4610888 0.7211702
```

## R

 `head``(train)`

Output:

```   admit gre  gpa rank
1      0 380 3.61    3
7      1 560 2.98    1
8      0 400 3.08    2
10     0 700 3.92    2
12     0 440 3.22    1
13     1 760 4.00    1
```

Then, we round up our results by creating a confusion matrix to compare the number of true/false positives and negatives. We will form a confusion matrix with training data.

## R

 `# confusion Matrix ``# \$Misclassification error -Training data  ``pre1<-``ifelse``(p1 > 0.5, 1, 0)``table<-``table``(Prediction = pre1, ``             ``Actual = train\$admit) ``table`

Output:

```          Actual
Prediction   0   1
0 158  55
1  11  25
```

The model generates 158 true negatives (0’s), 25 true positives (1’s), while there are 11 false negatives and 55 false positives. Now, let’s calculate the misclassification error (for training data) which  {1 – classification error}

## R

 `1 - ``sum``(``diag``(table)) / ``sum``(table)`

Output:

``` 0.2650602
```

The misclassification error comes out to be 24.9%. In this, we can use regression techniques with categorical variables to various other data.

Regression Analysis is a very efficient method and there are numerous types of regression models that one can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit like logistic regression is best suited for categorical variables.

My Personal Notes arrow_drop_up