Related Articles

Related Articles

How to Include Factors in Regression using R Programming?
  • Last Updated : 10 Nov, 2020

Categorical variables (also known as a factor or qualitative variables) are variables that classify observational values into groups. They are either string or numeric are called factor variables in statistical modeling. Saving normal string variables as factors save a lot of memory. Factors can also be stored as level or label variables. They have a limited number of different values, called levels. For example, the gender of individuals is a categorical variable that can take two levels: Male or Female. Regression requires numeric variables. So, when a researcher wants to include a categorical variable in a regression model, steps are needed to make the results interpretable. Let’s see all this with a code example in the R language.

Implementation in R

Storing strings or numbers as factors

First of all, let’s create a sample data set.  

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# creating sample
samp <- sample(0:1, 20, replace = TRUE)
samp

chevron_right


Output:

[1] 1 1 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 0 1


Converting the numbers set as factors.



R

filter_none

edit
close

play_arrow

link
brightness_4
code

samp <- sample(0:1, 20, replace = TRUE)
  
# converting sampleto factors
samp1 <- factor(samp)
  
# to find if its a factor lets use is.factor()
is.factor(samp1)

chevron_right


Output:

[5]TRUE


Now do the same things for strings.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# creating string sample
str1 <- c("small", "big", "medium", "small", "small",
          "big", "medium", "medium", "big")
str1
  
# will show output of string
f.str1<-factor(str1)
  
# check if f.str1 is factor or not
is.factor(f.str1)
  
# check if str1 is factor or not
is.factor(str1)

chevron_right


Output:

[1]"small"  "big"    "medium" "small"  "small"  "big"    "medium" "medium" "big"   
[10]TRUE
[12]FALSE


Factors with labels

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# creating sample with labels
lab <- factor(samp1, labels = c("sweet", "bitter"))
lab

chevron_right


Output:

bitter bitter sweet  bitter bitter bitter sweet  sweet  bitter sweet 
[11] sweet  bitter bitter bitter bitter bitter bitter bitter sweet  bitter
Levels: sweet bitter


Ordered factors

R



filter_none

edit
close

play_arrow

link
brightness_4
code

str1 <- c("small", "big", "medium", "small", "small",
          "big", "medium", "medium", "big")
  
# ordering the factors w.r.t levels
order <- ordered(str1, 
                 levels = c("small", "medium", "big"))
order
f.order <- factor(order)

chevron_right


Output:

[1] small  big    medium small  small  big    medium medium big   
Levels: small < medium < big


Another way to make a factor ordered is:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# another way to order 
f.order = factor(str1, 
                 levels = c("small", "medium", "big"),
                 ordered = TRUE)

chevron_right


For finding mean

R

filter_none

edit
close

play_arrow

link
brightness_4
code

mean(samp1)
  
# shows NA has output
mean(as.numeric(levels(samp1)[samp1]))

chevron_right


NA
0.7


 Removing levels

R

filter_none

edit
close

play_arrow

link
brightness_4
code

f.new <- f.order[f.order != "small"]
f.new

chevron_right


Output:

[1] big    medium big    medium medium big   
Levels: small < medium < big


Implement With Regression

Consider the experiment as the hours the students stay at school during fests.

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# consider a dataframe student
student <- data.frame(
    # id of students
    id = c (1:5), 
    
    # name of students  
    name = c("Payal", "Dan", "Misty", "Ryan", "Gargi"),
    
    # gender of students
    gender = c("F", "M", "F", "M", "F"),
    
    # gender represented in numbers F-1,M-0
    gender_num = c(1, 0, 1, 0, 1),
    
    # the hours students stay at fests
    hours = c(2.5, 4, 5.3, 3, 2.2)
)
student

chevron_right


Output:

  id  name   gender gender_num hours
1  1 Payal      F          1   2.5
2  2   Dan      M          0   4.0
3  3 Misty      F          1   5.3
4  4  Ryan      M          0   3.0
5  5 Gargi      F          1   2.2


The regression equation is



 y = b0 + b1*x 

Where 

y: output variable predicted on the basis of a predictor variable (x),  

b0 + b1: beta coefficients, representing the intercept and the slope, respectively.

b0 + b1: if a student is male, b0: if a student is female. The coefficients can be interpreted as follow:

  • b0 is the average hours’ female students stayed at fests,
  • b0 + b1 is the average hours’ male students stayed at fests and
  • b1 is the average difference in hours between male and female students.

 R creates dummy variables automatically with the following code:

R

filter_none

edit
close

play_arrow

link
brightness_4
code

# making the regression model
model <- lm(hours ~ gender, data = student) 
summary(model)$coef

chevron_right


Output:

            Estimate Std. Error   t value   Pr(>|t|)
(Intercept) 3.3333333  0.8397531 3.9694209 0.02857616
genderM     0.1666667  1.3277662 0.1255241 0.90804814


The estimated value for F students is 3.3333333 and for M student is 0.16666667. The Pr value of M students and F student is not so significant and only 0.90-0.02 ~ 0.9,i.e, there’s no actual evidence that M students stay more hours than females.

My Personal Notes arrow_drop_up
Recommended Articles
Page :