One-Hot Encoding in NLP

Last Updated : 30 Jun, 2023

Natural Language Processing (NLP) is a quickly expanding discipline that works with computer-human language exchanges. One of the most basic jobs in NLP is to represent text data numerically so that machine learning algorithms can comprehend it. One common method for accomplishing this is one-hot encoding, which converts category variables to binary vectors. In this essay, we’ll look at what one-hot encoding is, why it’s used in NLP, and how to do it in Python.

One of the most basic jobs in natural language processing (NLP) is to describe written data numerically so that machine learning systems can comprehend it. One common method for accomplishing this is one-hot encoding, which converts category variables to binary vectors.

One-Hot Encoding:

One-hot encoding is the process of turning categorical factors into a numerical structure that machine learning algorithms can readily process. It functions by representing each category in a feature as a binary vector of 1s and 0s, with the vector’s size equivalent to the number of potential categories.

For example, if we have a feature with three categories (A, B, and C), 
each category can be represented as a binary vector of length three, 
with the vector for category A being [1, 0, 0], 
the vector for category B being [0, 1, 0], and the vector for category C being [0, 0, 1].

Why One-Hot Encoding is Used in NLP:

One-hot encoding is used in NLP to encode categorical factors as binary vectors, such as words or part-of-speech identifiers.
This approach is helpful because machine learning algorithms generally act on numerical data, so representing text data as numerical vectors are required for these algorithms to work.
In a sentiment analysis assignment, for example, we might describe each word in a sentence as a one-hot encoded vector and then use these vectors as input to a neural network to forecast the sentiment of the sentence.

Example 1:

Suppose we have a small corpus of text that contains three sentences:

The quick brown fox jumped over the lazy dog.
She sells seashells by the seashore.
Peter Piper picked a peck of pickled peppers.

Each word in these phrases should be represented as a single compressed vector. The first stage is to determine the categorical variable, which is the phrases’ terms. The second stage is to count the number of distinct words in the sentences to calculate the number of potential groups. In this instance, there are 17 potential categories.
The third stage is to make a binary vector for each of the categories. Because there are 17 potential groups, each binary vector will be 17 bytes long. For example, the binary vector for the word “quick” will be [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], with the 1s in the first and sixth places because “quick” is both the first and sixth group in the list of unique words.
Finally, we use the binary vectors generated in step 3 to symbolize each word in the sentences as a one-hot encoded vector. For example, the one-hot encoded vector for the word “quick” in the first sentence is [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], and the one-hot encoded vector for the word “seashells” in the second sentence is [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0].

Python Implementation for One-Hot Encoding in NLP

Now let’s try to implement the above example using Python. Because finally, we will have to perform this programmatically else it won’t be possible for us to use this technique to train NLP models.

Python3

import numpy as np
 
# Define the corpus of text
corpus = [
    "The quick brown fox jumped over the lazy dog.",
    "She sells seashells by the seashore.",
    "Peter Piper picked a peck of pickled peppers."
]
 
# Create a set of unique words in the corpus
unique_words = set()
for sentence in corpus:
    for word in sentence.split():
        unique_words.add(word.lower())
 
# Create a dictionary to map each
# unique word to an index
word_to_index = {}
for i, word in enumerate(unique_words):
    word_to_index[word] = i
 
# Create one-hot encoded vectors for
# each word in the corpus
one_hot_vectors = []
for sentence in corpus:
    sentence_vectors = []
    for word in sentence.split():
        vector = np.zeros(len(unique_words))
        vector[word_to_index[word.lower()]] = 1
        sentence_vectors.append(vector)
    one_hot_vectors.append(sentence_vectors)
 
# Print the one-hot encoded vectors 
# for the first sentence
print("One-hot encoded vectors for the first sentence:")
for vector in one_hot_vectors[0]:
    print(vector)

Output:

One-hot encoded vectors for the first sentence:
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

As you can see from the output, each word in the first sentence has been represented as a one-hot encoded vector of length 17, which corresponds to the number of unique words in the corpus. The one-hot encoded vector for the word “quick” is [0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1.].

Example 2:

Assume we have a text collection that includes three sentences:

The cat sat on the mat.
The dog chased the cat.
The mat was soft and fluffy.

Each word in these phrases should be represented as a single compressed vector. We begin by identifying the categorical variable (the words in the sentences) and determining the number of potential groups (the number of distinct words in the sentences), which is 7 in this instance.
Following that, we generate a binary array with a length of 7 for each group. Because “cat” is the first category in the collection of unique terms, the binary vector for the word will be [1, 0, 0, 0, 0, 0, 0].
Finally, we use the binary vectors generated in the previous stage to symbolize each word in the sentences as a one-hot encoded vector. For example, in the first sentence, the one-hot encoded vector for the word “mat” is [0, 0, 0, 0, 0, 1], and in the second sentence, the one-hot encoded vector for the word “dog” is [0, 0, 1, 0, 0, 0].

This code initially generates a collection of unique words from the corpus, followed by a dictionary that translates each word into a number. It then iterates through the corpus, creating a binary vector with a 1 at the place corresponding to the word’s integer mapping and a 0 elsewhere for each word in each phrase. The resultant one-hot encoded vectors are displayed for each word in each phrase.

Python3

import numpy as np
 
# Define the sentences
sentences = [
    'The cat sat on the mat.',
    'The dog chased the cat.',
    'The mat was soft and fluffy.'
]
 
# Create a vocabulary set
vocab = set()
for sentence in sentences:
    words = sentence.lower().split()
    for word in words:
        vocab.add(word)
 
# Create a dictionary to map words to integers
word_to_int = {word: i for i, word in enumerate(vocab)}
 
# Create a binary vector for each word in each sentence
vectors = []
for sentence in sentences:
    words = sentence.lower().split()
    sentence_vectors = []
    for word in words:
        binary_vector = np.zeros(len(vocab))
        binary_vector[word_to_int[word]] = 1
        sentence_vectors.append(binary_vector)
    vectors.append(sentence_vectors)
 
# Print the one-hot encoded vectors for each word in each sentence
for i in range(len(sentences)):
    print(f"Sentences {i + 1}:")
    for j in range(len(vectors[i])):
        print(f"{sentences[i].split()[j]}: {vectors[i][j]}")

Output:

Sentences 1:
The: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
cat: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
sat: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
on: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
the: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
mat.: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
Sentences 2:
The: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
dog: [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
chased: [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
the: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
cat.: [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
Sentences 3:
The: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
mat: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
was: [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
soft: [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
and: [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
fluffy.: [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]

As can be seen, each word is represented as a one-hot encoded vector of length equal to the number of distinct words in the corpus. (which is 7 in this case). Each vector has a 1 in the place corresponding to the word’s integer mapping in the vocabulary set, and a 0 elsewhere.

Drawbacks of One-Hot Encoding in NLP

One of the major disadvantages of one-hot encoding in NLP is that it produces high-dimensional sparse vectors that can be extremely costly to process. This is due to the fact that one-hot encoding generates a distinct binary vector for each unique word in the text, resulting in a very big feature space. Furthermore, because one-hot encoding does not catch the semantic connections between words, machine-learning models that use these vectors as input may perform poorly. As a result, other encoding methods, such as word embeddings, are frequently used in NLP jobs. Word embeddings convert words into low-dimensional dense vectors that record meaningful connections between words, making them more useful for many NLP tasks.

Suggest improvement

One Hot Encoding in Machine Learning

Share your thoughts in the comments