# Feature Encoding Techniques – Machine Learning

As we all know that better encoding leads to a better model and most of the algorithms cannot handle the categorical variables unless they are converted into a numerical value.

Categorical features are generally divided into 3 types:

1. Binary: Either/or
Examples:

• Yes, No
• True, False
2. Ordinal: Specific ordered Groups.
Examples:

• low, medium, high
• cold, hot, lava Hot
3. Nominal: Unordered Groups.
Examples

• cat, dog, tiger
• pizza, burger, coke
encoding dataset

Code:

 `# data preprocessing ` `import` `pandas as pd   ` `# for linear calculations ` `import` `numpy as np     ` `# Plotting Graphs ` `import` `seaborn as sns  ` `df ``=` `pd.read_csv(``"Encoding Data.csv"``) ` `# displaying top 10 results ` `df.head(``10``)    `

Output: Dataset

Lets’ examine the columns of the dataset with different types of encoding techniques.
Code: Mapping binary features present in the dataset.

 `# you can always use simple mapping on binary features. ` `df[``'bin_1'``] ``=` `df[``'bin_1'``].``apply``(``lambda` `x: ``1` `if` `x ``=``=``'T'` `else` `(``0` `if` `x ``=``=``'F'` `else` `None``))  ` `df[``'bin_2'``] ``=` `df[``'bin_2'``].``apply``(``lambda` `x: ``1` `if` `x ``=``=``'Y'` `else` `(``0` `if` `x ``=``=``'N'` `else` `None``)) ` `sns.countplot(df[``'bin_1'``]) ` `sns.countplot(df[``'bin_2'``]) `

Output: Bin_1 after appying mapping bin_2 after appying mapping

Label Encoding: Label encoding algorithm is quite simple and it considers an order for encoding, Hence can be used for encoding ordinal data.
Code:

 `# labelEncoder present in scikitlearn library ` `from` `sklearn.preprocessing ``import` `LabelEncoder   ` `le ``=` `LabelEncoder() ` `df[``'ord_2'``] ``=` `le.fit_transform(df[``'ord_2'``]) ` `sns.``set``(style ``=``"darkgrid"``) ` `sns.countplot(df[``'ord_2'``]) `

Output: Plot of ord_2 after label encoding

One-Hot Encoding: To overcome the Disadvantage of Label Encoding as it considers some hierarchy in the columns which can be misleading to nominal features present in the data. we can use One-Hot Encoding strategy.
One-hot encoding is processed in 2 steps:

1. Spliting of categories to different columns.
2. Put ‘0 for others and ‘1’ as an indicator for the appropriate column.

Code: One-Hot encoding with Sklearn libray

 `from` `sklearn.preprocessing ``import` `OneHotEncoder ` `enc ``=` `OneHotEncoder() ` `# tranforming the column after fitting ` `enc ``=` `enc.fit_transform(df[[``'nom_0'``]]).toarray() ` `# converting arrays to a dataframe ` `encoded_colm ``=` `pd.DataFrame(enc) ` `# concating dataframes  ` `df ``=` `pd.concat([df, encoded_colm], axis ``=` `1``)  ` `# removing the encoded column. ` `df ``=` `df.drop([``'nom_0'``], axis ``=` `1``)  ` `df.head(``10``) `

Output: Output

Code: One-Hot encoding with pandas

 `df ``=` `pd.get_dummies(df, prefix ``=` `[``'nom_0'``], columns ``=` `[``'nom_0'``]) ` `df.head(``10``) `

Output: output

This method is more preferable since it gives good labels.
Note: One-hot encoding approach eliminates the order but it causes the number of columns to expand vastly. So for columns with more unique values try using other techniques.

Frequency Encoding: We can also encode considering the frequency distribution. This method can be effective at times for nominal features.

Code:

 `# grouping by frequency ` `fq ``=` `df.groupby(``'nom_0'``).size()``/``len``(df)    ` `# mapping values to dataframe ` `df.loc[:, ``"{}_freq_encode"``.``format``(``'nom_0'``)] ``=` `df[``'nom_0'``].``map``(fq)   ` `# drop original column. ` `df ``=` `df.drop([``'nom_0'``], axis ``=` `1``)  ` `fq.plot.bar(stacked ``=` `True``)   ` `df.head(``10``) `

Output: Frequency distribution (fq) Output

Ordinal Encoding: We can use Ordinal Encoding provided in Scikit learn class to encode Ordinal features. It ensures that ordinal nature of the variables is sustained.
Code: Using Scikit learn.

 `from` `sklearn.preprocessing ``import` `OrdinalEncoder ` `ord1 ``=` `OrdinalEncoder() ` `# fitting encoder ` `ord1.fit([df[``'ord_2'``]]) ` `# tranforming the column after fitting ` `df[``"ord_2"``]``=` `ord1.fit_transform(df[[``"ord_2"``]])  ` `df.head(``10``) `

Output: Output

Code: Manually assigning ranking by using dictionary

 `# creating a dictionary ` `temp_dict ``=``{``'Cold'``:``1``, ``'Warm'``:``2``, ``'Hot'``:``3``}  ` `# mapping values in column from dictionary ` `df[``'Ord_2_encod'``]``=` `df.ord_2.``map``(temp_dict) ` `df ``=` `df.drop([``'ord_2'``], axis ``=` `1``) ` `Output: <``/``strong>` Output

Binary Encoding:
Initially categories are encoded as Integer and then converted into binary code, then the digits from that binary string are placed into separate columns.
for eg: for 7 : 1 1 1
This method is quite preferable when there are more number of categories. Imagine if you have 100 different categories. One hot encoding will create 100 different columns, But binary encoding only need 7 columns.
Code:

 `from` `category_encoders ``import` `BinaryEncoder  ` `encoder ``=` `BinaryEncoder(cols ``=``[``'ord_2'``])  ` `# tranforming the column after fitting ` `newdata ``=` `encoder.fit_transform(df[``'ord_2'``]) ` `# concating dataframe ` `df ``=` `pd.concat([df, newdata], axis ``=` `1``)  ` `# dropping old column  ` `df ``=` `df.drop([``'ord_2'``], axis ``=` `1``) ` `df.head(``10``) `

Output: Output

HashEncoding: Hashing is the process of converting of a string of characters into a unique hash value with applying a hash function. This process is quite useful as it can deal with a higher number of categorical data and its low memory usage.
Article regarding hashing
Code:

 `from` `sklearn.feature_extraction ``import` `FeatureHasher ` `# n_features contains the number of bits you want in your hash value. ` `h ``=` `FeatureHasher(n_features ``=` `3``, input_type ``=``'string'``)  ` `# tranforming the column after fitting ` `hashed_Feature ``=` `h.fit_transform(df[``'nom_0'``]) ` `hashed_Feature ``=` `hashed_Feature.toarray() ` `df ``=` `pd.concat([df, pd.DataFrame(hashed_Feature)], axis ``=` `1``) ` `df.head(``10``) `

Output: Output

You can further drop the converted feature from your Dataframe.
Mean/Target Encoding: Target encoding is good because it picks up values that can explain the target. It is used by most kagglers in their competitions. The basic idea to replace a categorical value with the mean of the target variable.
Code:

 `# inserting Target column in the dataset since it needs a target ` `df.insert(``5``, ``"Target"``, [``0``, ``1``, ``1``, ``0``, ``0``, ``1``, ``0``, ``0``, ``0``, ``1``], ``True``)  ` `# importing TargetEncoder ` `from` `category_encoders ``import` `TargetEncoder ` `Targetenc ``=` `TargetEncoder() ` `# tranforming the column after fitting ` `values ``=` `Targetenc.fit_transform(X ``=` `df.nom_0, y ``=` `df.Target) ` `# concating values with dataframe ` `df ``=` `pd.concat([df, values], axis ``=` `1``) ` `df.head(``10``) `

You can further drop the converted feature from your Dataframe.
Output: output

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.

My Personal Notes arrow_drop_up Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.

Article Tags :
Practice Tags :

9

Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.