Skip to content
Related Articles

Related Articles

Improve Article

ML | One Hot Encoding to treat Categorical data parameters

  • Difficulty Level : Easy
  • Last Updated : 05 Jul, 2021

Sometimes in datasets, we encounter columns that contain categorical features (string values) for example parameter Gender will have categorical parameters like Male, Female. These labels have no specific order of preference and also since the data is string labels, the machine learning model can not work on such data.

One approach to solve this problem can be label encoding where we will assign a numerical value to these labels for example Male and Female mapped to 0 and 1. But this can add bias in our model as it will start giving higher preference to the Female parameter as 1>0 and ideally both labels are equally important in the dataset. To deal with this issue we will use One Hot Encoding technique.

One Hot Encoding:

In this technique, we each of the categorical parameters, it will prepare separate columns for both Male and Female label. SO, whenever there is Male in Gender, it will 1 in Male column and 0 in Female column and vice-versa.

Let’s understand with an example:

Consider the data where fruits and their corresponding categorical value and prices are given.



FruitCategorical value of fruitPrice
apple15
mango210
apple115
orange320


The output after one hot encoding the data is given as follows,
applemangoorangeprice
1005
01010
10015
00120


Code: Python code implementation of One-Hot Encoding Technique

Loading the data




# Program for demonstration of one hot encoding
  
# import libraries
import numpy as np
import pandas as pd
  
# import the data required
data = pd.read_csv("employee_data.csv")
print(data.head())

Output:

Checking for the labels in the categorical parameters




print(data['Gender'].unique())
print(data['Remarks'].unique())

Output:

array(['Male', 'Female'], dtype=object)
array(['Nice', 'Good', 'Great'], dtype=object)

Checking for the label counts in the categorical parameters




data['Gender'].value_counts()
data['Remarks'].value_counts()

Output:

Female    7
Male      5
Name: Gender, dtype: int64

Nice     5
Great    4
Good     3
Name: Remarks, dtype: int64

One-Hot encoding the categorical parameters using get_dummies()




one_hot_encoded_data = pd.get_dummies(data, columns = ['Remarks', 'Gender'])
print(one_hot_encoded_data)

Output:

We can observe that we have 3 Remarks and 2 Gender columns in the data. However, you can just use n-1 columns to define parameters if it has n unique labels. For example if we only keep Gender_Female column and drop Gender_Male column, then also we can convey the entire information as when label is 1, it means female and when label is 0 it means male. This way we can encode the categorical data and reduce the number of parameters as well.

Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.




My Personal Notes arrow_drop_up
Recommended Articles
Page :