Encoding Categorical Data in R

Last Updated : 01 Aug, 2023

Encoding Categorical Data in R

The categorical variables are very often found in data while conducting data analysis and ML(machine learning). The Data which can be classified into categories or groups, such as colors or job titles is generally called as categorical data. The categorical variables must be encoded into numerical values in order to be used in statistical analysis or ML models.

What is Categorical Data?

The data that can be categorized or grouped together is known as categorical data. The Colors and job titles are a few examples of categorical variables that can have one of a small, finite number of values or levels. The Survey data, demographic data, and marketing data frequently use the categorical data.

Categorical variables must be encoded into numerical values in order to be used in statistical analysis or machine learning models. This method is known as encoding.

Encoding Methods in R

The Categorical data can be encoded in R using a variety of techniques. We’ll go over three of the most popular approaches: label encoding, frequency encoding, and onehot encoding.

One-Hot Encoding

The One-Hot A method of encoding category information into a binary matrix is called encoding. A column in the matrix is given to each distinct value in the categorical variable. The corresponding column will be given a value of 1, and all other columns will be given a value of 0, if the value is present in that particular row.

On consider the following data frame as an illustration:

R

gender <-  c("male", "female", "male", "male", "female")
age    <-  c(23, 34, 52, 21, 19)
income <-  c(50000, 70000, 80000, 45000, 55000)
df     <-  data.frame(gender, age, income)

We can utilize the model to one-hot encode the gender column. function matrix()

R

encoded_gender <- model.matrix(~gender-1, data=df)

Output:

the -1 in the model’s formula argument. R is instructed not to include an intercept column by the matrix() function.

Label Encoding

The Label encoding method is for encoding categorical variables that assigns the number value to each distinct value. For the instance, the numerical values 1, 2, and 3 might be assigned to a categorical variable with the three unique values of “red,” “green,” and “blue,” respectively.

The factor() function in R can be used to turn a category variable into a factor, that can subsequently be turned into integers using the as.integer() function.

Consider the following data frame as an illustration:

R

color <-  c("red", "green", "blue", "blue", "red")
 
df    <-  data.frame(color)

We can use the following code to label encrypt the color column :

R

df$color <- as.integer(factor(df$color))

Output:

It should be noted that the numbers given to each unique value are chosen at random and have no inherent significance.

Frequency Encoding

The Frequency Each distinct value is assigned the frequency with which it occurs in the data when encoding categorical variables. The numerical values for each of these values may be 3, 4, and 2, respectively, if a categorical variable has three distinct values (red, green, and blue), and each of those values appears three, four, or two times.

On consider the following data frame as an illustration :

R

color <-  c("red", "green", "blue", "blue", "red")
 
df    <-  data.frame(color)

To frequency encode the color column, we can use the following code:

R

freq_count  <-  table(df$color)
df$color    <-  match(df$color, names(freq_count))

Output:

It is important to note that the numerical values supplied to each unique value match the frequency counts for those values in the data.

Choosing an Encoding Method

The kind of analysis or model being utilised, as well as the characteristics of the data, influence the encoding technique selection. While Label Encoding and Frequency Encoding are frequently used for categorical variables with a lesser number of unique values ,One-Hot Encoding is frequently utilised for categorical variables with a large number of unique values.

It is crucial to remember that Frequency Encoding and Label Encoding both have the potential to inject unexpected ordering or hierarchy into the data, which could have an impact on the validity of the analysis or model. One-Hot Encoding may be a better encoding technique in some circumstances.

Difference between all of them

Different encoding methods are available in R that can be used to convert categorical variables into numerical representations that computer learning systems can quickly understand. Here are the distinctions between three widely used encoding methods: label, frequency, and one-hot encoding.

One-hot encoding is a method for converting categorical variables into binary vectors. This encoding method uses a binary vector to symbolise each category, with only one element set to 1 and the others set to 0. A categorical variable with the three groups A, B, and C, for instance, can be represented as [1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively. When there are few groups and we need to sort through them quickly.

Frequency encoding is a method for representing categorical factors based on how frequently they appear in the data. Each group is given a numerical value in this encoding method based on how frequently it appears in the data. For instance, we can represent a categorical variable as 0.25, 0.5, and 0.75 if it has three categories, A, B, and C, and their relative frequencies are 10, 20, and 30. When there are many categories and we want to record information about how often each category occurs, this method can be helpful.Frequency encoding is a method for representing categorical factors based on how frequently they appear in the data. Each group in this encoding method receives .

By giving each category a different numerical value, the label encoding method is used to represent categorical variables. Each category is given a numerical value in this encoding method based on the order or location in the data. For instance, we can express the three categories A, B, and C of a categorical variable as 1, 2, and 3 respectively. When there is an ordinal relationship between a limited number of categories, this technique is helpful.

Conclusion

The categorical data along with the importance of numerically encoding categorical variables for use in statistical analysis and machine learning models. The three commonly used approaches of encoding categorical data in R: One-Hot Encoding, Label Encoding, and Frequency Encoding. By depending on the nature of the data and the analysis or model being utilized , a particular encoding technique will be used.

Suggest improvement

Handling Categorical Data in Python

Share your thoughts in the comments

Encoding Categorical Data in R