Skip to content
Related Articles

Related Articles

ML | One Hot Encoding to treat Categorical data parameters

View Discussion
Improve Article
Save Article
  • Difficulty Level : Basic
  • Last Updated : 21 Jun, 2022

Most Machine Learning algorithms cannot work with categorical data and needs to be converted into numerical data. Sometimes in datasets, we encounter columns that contain categorical features (string values) for example parameter Gender will have categorical parameters like Male, Female. These labels have no specific order of preference and also since the data is string labels, machine learning models misinterpreted that there is some sort of hierarchy in them.

 

Machine-Learning-Course

 One approach to solve this problem can be label encoding where we will assign a numerical value to these labels for example Male and Female mapped to 0 and 1. But this can add bias in our model as it will start giving higher preference to the Female parameter as 1>0 and ideally both labels are equally important in the dataset. To deal with this issue we will use One Hot Encoding technique.

One Hot Encoding:

In this technique, the categorical parameters will prepare separate columns for both Male and Female labels. So, wherever there is Male, the value will be 1 in Male column and 0 in Female column, and vice-versa. Let’s understand with an example: Consider the data where fruits and their corresponding categorical values and prices are given.

FruitCategorical value of fruitPrice
apple15
mango210
apple115
orange320

The output after one-hot encoding of the data is given as follows,

applemangoorangeprice
1005
01010
10015
00120

Code: Python code implementation of Manual One-Hot Encoding Technique Loading the data 

Python3




# Program for demonstration of one hot encoding
 
# import libraries
import numpy as np
import pandas as pd
 
# import the data required
data = pd.read_csv("employee_data.csv")
print(data.head())

Output:  

Checking for the labels in the categorical parameters 

Python3




print(data['Gender'].unique())
print(data['Remarks'].unique())

Output:

array(['Male', 'Female'], dtype=object)
array(['Nice', 'Good', 'Great'], dtype=object)

Checking for the label counts in the categorical parameters 

Python3




data['Gender'].value_counts()
data['Remarks'].value_counts()

Output:

Female    7
Male      5
Name: Gender, dtype: int64

Nice     5
Great    4
Good     3
Name: Remarks, dtype: int64

One-Hot encoding the categorical parameters using get_dummies() 

Python3




one_hot_encoded_data = pd.get_dummies(data, columns = ['Remarks', 'Gender'])
print(one_hot_encoded_data)

Output:  

We can observe that we have 3 Remarks and 2 Gender columns in the data. However, you can just use n-1 columns to define parameters if it has n unique labels. For example if we only keep Gender_Female column and drop Gender_Male column, then also we can convey the entire information as when label is 1, it means female and when label is 0 it means male. This way we can encode the categorical data and reduce the number of parameters as well.

One Hot Encoding using Sci-kit learn Library: 

One hot encoding algorithm is an encoding system of Sci-kit learn library. One Hot Encoding is used to convert numerical categorical variables into binary vectors. Before implementing this algorithm. Make sure the categorical values must be label encoded as one hot encoding takes only numerical categorical values. 

Python3




#importing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
 
#Retrieving data
data = pd.read_csv('Employee_data.csv')
 
# Converting type of columns to category
data['Gender']=data['Gender'].astype('category')
data['Remarks']=data['Remarks'].astype('category')
 
 
#Assigning numerical values and storing it in another columns
data['Gen_new']=data['Gender'].cat.codes
data['Rem_new']=data['Remarks'].cat.codes
 
 
#Create an instance of One-hot-encoder
enc=OneHotEncoder()
 
#Passing encoded columns
'''
NOTE: we have converted the enc.fit_transform() method to array because the fit_transform method
of OneHotEncoder returns SpiPy sparse matrix this enables us to save space when we
have huge  number of categorical variables
'''
enc_data=pd.DataFrame(enc.fit_transform(data[['Gen_new','Rem_new']]).toarray())
 
#Merge with main
New_df=data.join(enc_data)
 
print(New_df)

Output:

    Employee_Id  Gender Remarks  Gen_new  Rem_new    0    1    2    3    4
0            45    Male    Nice        1        2  0.0  1.0  0.0  0.0  1.0
1            78  Female    Good        0        0  1.0  0.0  1.0  0.0  0.0
2            56  Female   Great        0        1  1.0  0.0  0.0  1.0  0.0
3            12    Male   Great        1        1  0.0  1.0  0.0  1.0  0.0
4             7  Female    Nice        0        2  1.0  0.0  0.0  0.0  1.0
5            68  Female   Great        0        1  1.0  0.0  0.0  1.0  0.0
6            23    Male    Good        1        0  0.0  1.0  1.0  0.0  0.0
7            45  Female    Nice        0        2  1.0  0.0  0.0  0.0  1.0
8            89    Male   Great        1        1  0.0  1.0  0.0  1.0  0.0
9            75  Female    Nice        0        2  1.0  0.0  0.0  0.0  1.0
10           47  Female    Good        0        0  1.0  0.0  1.0  0.0  0.0
11           62    Male    Nice        1        2  0.0  1.0  0.0  0.0  1.0

Using get_dummies approach:

Python3




one_hot_encoded_data = pd.get_dummies(data, columns = ['Gender','Remarks'])
print(one_hot_encoded_data)

   Employee_Id  Gen_new  Rem_new  Gender_Female  Gender_Male  Remarks_Good     Remarks_Great    Remarks_Nice
0            45        1        2              0            1             0               0                1
1            78        0        0              1            0             1                 0                0
2            56        0        1              1            0             0                 1                0
3            12        1        1              0            1             0                 1                0
4             7        0        2              1            0             0                 0                1
5            68        0        1              1            0             0                 1                0
6            23        1        0              0            1             1                 0                0
7            45        0        2              1            0             0                 0                1
8            89        1        1              0            1             0                 1                0
9            75        0        2              1            0             0                 0                1
10           47        0        0              1            0             1                 0                0
11           62        1        2              0            1             0                 0                1
   

My Personal Notes arrow_drop_up
Recommended Articles
Page :

Start Your Coding Journey Now!