Categorical variables (also known as a factor or qualitative variables) are variables that classify observational values into groups. They are either string or numeric are called factor variables in statistical modeling. Saving normal string variables as factors save a lot of memory. Factors can also be stored as level or label variables. They have a limited number of different values, called levels. For example, the gender of individuals is a categorical variable that can take two levels: **Male or Female**. Regression requires numeric variables. So, when a researcher wants to include a categorical variable in a regression model, steps are needed to make the results interpretable. Let’s see all this with a code example in the R language.

### Implementation in R

**Storing strings or numbers as factors**

First of all, let’s create a sample data set.

## R

`# creating sample` `samp <- ` `sample` `(0:1, 20, replace = ` `TRUE` `)` `samp` |

**Output:**

[1] 1 1 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 1

Converting the numbers set as factors.

## R

`samp <- ` `sample` `(0:1, 20, replace = ` `TRUE` `)` ` ` `# converting sampleto factors` `samp1 <- ` `factor` `(samp)` ` ` `# to find if its a factor lets use is.factor()` `is.factor` `(samp1)` |

**Output:**

[5]TRUE

Now do the same things for strings.

## R

`# creating string sample` `str1 <- ` `c` `(` `"small"` `, ` `"big"` `, ` `"medium"` `, ` `"small"` `, ` `"small"` `,` ` ` `"big"` `, ` `"medium"` `, ` `"medium"` `, ` `"big"` `)` `str1` ` ` `# will show output of string` `f.str1<-` `factor` `(str1)` ` ` `# check if f.str1 is factor or not` `is.factor` `(f.str1)` ` ` `# check if str1 is factor or not` `is.factor` `(str1)` |

**Output:**

[1]"small" "big" "medium" "small" "small" "big" "medium" "medium" "big" [10]TRUE [12]FALSE

**Factors with labels**

## R

`# creating sample with labels` `lab <- ` `factor` `(samp1, labels = ` `c` `(` `"sweet"` `, ` `"bitter"` `))` `lab` |

**Output:**

bitter bitter sweet bitter bitter bitter sweet sweet bitter sweet [11] sweet bitter bitter bitter bitter bitter bitter bitter sweet bitter Levels: sweet bitter

**Ordered factors**

## R

`str1 <- ` `c` `(` `"small"` `, ` `"big"` `, ` `"medium"` `, ` `"small"` `, ` `"small"` `,` ` ` `"big"` `, ` `"medium"` `, ` `"medium"` `, ` `"big"` `)` ` ` `# ordering the factors w.r.t levels` `order <- ` `ordered` `(str1, ` ` ` `levels = ` `c` `(` `"small"` `, ` `"medium"` `, ` `"big"` `))` `order` `f.order <- ` `factor` `(order)` |

**Output:**

[1] small big medium small small big medium medium big Levels: small < medium < big

Another way to make a factor ordered is:

## R

`# another way to order ` `f.order = ` `factor` `(str1, ` ` ` `levels = ` `c` `(` `"small"` `, ` `"medium"` `, ` `"big"` `),` ` ` `ordered = ` `TRUE` `)` |

**For finding mean**

## R

`mean` `(samp1)` ` ` `# shows NA has output` `mean` `(` `as.numeric` `(` `levels` `(samp1)[samp1]))` |

NA 0.7

#### **Removing levels**

## R

`f.new <- f.order[f.order != ` `"small"` `]` `f.new` |

**Output:**

[1] big medium big medium medium big Levels: small < medium < big

### Implement With Regression

Consider the experiment as the hours the students stay at school during fests.

## R

`# consider a dataframe student` `student <- ` `data.frame` `(` ` ` `# id of students` ` ` `id = ` `c ` `(1:5), ` ` ` ` ` `# name of students ` ` ` `name = ` `c` `(` `"Payal"` `, ` `"Dan"` `, ` `"Misty"` `, ` `"Ryan"` `, ` `"Gargi"` `),` ` ` ` ` `# gender of students` ` ` `gender = ` `c` `(` `"F"` `, ` `"M"` `, ` `"F"` `, ` `"M"` `, ` `"F"` `),` ` ` ` ` `# gender represented in numbers F-1,M-0` ` ` `gender_num = ` `c` `(1, 0, 1, 0, 1),` ` ` ` ` `# the hours students stay at fests` ` ` `hours = ` `c` `(2.5, 4, 5.3, 3, 2.2)` `)` `student` |

**Output:**

id name gender gender_num hours 1 1 Payal F 1 2.5 2 2 Dan M 0 4.0 3 3 Misty F 1 5.3 4 4 Ryan M 0 3.0 5 5 Gargi F 1 2.2

The regression equation is

** y = b _{0} + b_{1}*x**

Where

y:output variable predicted on the basis of a predictor variable (x),

bbeta coefficients, representing the intercept and the slope, respectively._{0}+ b_{1}:

**b _{0} + b_{1:}** if a student is male,

**b**if a student is female. The coefficients can be interpreted as follow:

_{0:}**b**is the average hours’ female students stayed at fests,_{0}**b**is the average hours’ male students stayed at fests and_{0}+ b_{1}**b**is the average difference in hours between male and female students._{1}

R creates dummy variables automatically with the following code:

## R

`# making the regression model` `model <- ` `lm` `(hours ~ gender, data = student) ` `summary` `(model)$coef` |

**Output:**

Estimate Std. Error t value Pr(>|t|) (Intercept) 3.3333333 0.8397531 3.9694209 0.02857616 genderM 0.1666667 1.3277662 0.1255241 0.90804814

The estimated value for F students is 3.3333333 and for M student is 0.16666667. The Pr value of M students and F student is not so significant and only 0.90-0.02 ~ 0.9,i.e, there’s no actual evidence that M students stay more hours than females.