Prerequisite: Introduction to word2vec
Natural language processing (NLP) is a subfield of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages.
In NLP techniques, we map the words and phrases (from vocabulary or corpus) to vectors of numbers to make the processing easier. These types of language modeling techniques are called word embeddings.
In 2013, Google announched word2vec, a group of related models that are used to produce word embeddings.
Let’s implement our own skip-gram model (in Python) by deriving the backpropagation equations of our neural network.
In skip gram architecture of word2vec, the input is the center word and the predictions are the context words. Consider an array of words W, if W(i) is the input (center word), then W(i-2), W(i-1), W(i+1), and W(i+2) are the context words, if the sliding window size is 2.
Let's define some variables : V Number of unique words in our corpus of text ( Vocabulary ) x Input layer (One hot encoding of our input word ). N Number of neurons in the hidden layer of neural network W Weights between input layer and hidden layer W' Weights between hidden layer and output layer y A softmax output layer having probabilities of every word in our vocabulary
Our neural network architecture is defined, now let’s do some math to derive the equations needed for gradient descent.
Multiplying one hot encoding of centre word (denoted by x) with the first weight matrix W to get hidden layer matrix h (of size N x 1).
( Vx1 ) ( NxV ) ( Vx1 )
Now we multiply the hidden layer vector h with second weight matrix W’ to get a new matrix u
( Vx1 ) ( VxN ) ( Nx1 )
Note that we have to apply a softmax> to layer u to get our output layer y.
Let uj be jth neuron of layer u
Let wj be the jth word in our vocabulary where j is any index
Let Vwj be the jth column of matrix W’(column corresponding to a word wj)
( 1×1 ) ( 1xN ) ( Nx1 )
y = softmax(u)
yj = softmax(uj)
yj denotes the probability that wj is a context word
P(wj|wi) is the probability that wj is a context word, given wi is the input word.
Thus, our goal is to maximise P( wj* | wi ), where j* represents the indices of context words
Clearly we want to maximise
where j*c are the vocabulary indexes of context words . Context words range from c = 1, 2, 3..C
Let’s take a negative log likelihood of this function to get our loss function, which we want to minimise
Let t be actual output vector from our training data, for a particular centre word. It will have 1’s at the positions of context words and 0’s at all other places. tj*c are the 1’s of the context words.
We can multiply with
Solving this equation we get our loss function as –
The parameters to be adjusted are in the matrices W and W’, hence we have to find the partial derivatives of our loss function with respect to W and W’ to apply gradient descent algorithm.
We have to find
Below is the implementation :