Prerequisite: Introduction to word2vec
Natural language processing (NLP) is a subfield of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages.
In NLP techniques, we map the words and phrases (from vocabulary or corpus) to vectors of numbers to make the processing easier. These types of language modeling techniques are called word embeddings.
In 2013, Google announched word2vec, a group of related models that are used to produce word embeddings.
Let’s implement our own skip-gram model (in Python) by deriving the backpropagation equations of our neural network.
In skip gram architecture of word2vec, the input is the center word and the predictions are the context words. Consider an array of words W, if W(i) is the input (center word), then W(i-2), W(i-1), W(i+1), and W(i+2) are the context words, if the sliding window size is 2.
Let's define some variables : V Number of unique words in our corpus of text ( Vocabulary ) x Input layer (One hot encoding of our input word ). N Number of neurons in the hidden layer of neural network W Weights between input layer and hidden layer W' Weights between hidden layer and output layer y A softmax output layer having probabilities of every word in our vocabulary
Our neural network architecture is defined, now let’s do some math to derive the equations needed for gradient descent.
Multiplying one hot encoding of centre word (denoted by x) with the first weight matrix W to get hidden layer matrix h (of size N x 1).
( Vx1 ) ( NxV ) ( Vx1 )
Now we multiply the hidden layer vector h with second weight matrix W’ to get a new matrix u
( Vx1 ) ( VxN ) ( Nx1 )
Note that we have to apply a softmax> to layer u to get our output layer y.
Let uj be jth neuron of layer u
Let wj be the jth word in our vocabulary where j is any index
Let Vwj be the jth column of matrix W’(column corresponding to a word wj)
( 1×1 ) ( 1xN ) ( Nx1 )
y = softmax(u)
yj = softmax(uj)
yj denotes the probability that wj is a context word
P(wj|wi) is the probability that wj is a context word, given wi is the input word.
Thus, our goal is to maximise P( wj* | wi ), where j* represents the indices of context words
Clearly we want to maximise
where j*c are the vocabulary indexes of context words . Context words range from c = 1, 2, 3..C
Let’s take a negative log likelihood of this function to get our loss function, which we want to minimise
Let t be actual output vector from our training data, for a particular centre word. It will have 1’s at the positions of context words and 0’s at all other places. tj*c are the 1’s of the context words.
We can multiply with
Solving this equation we get our loss function as –
The parameters to be adjusted are in the matrices W and W’, hence we have to find the partial derivatives of our loss function with respect to W and W’ to apply gradient descent algorithm.
We have to find
Below is the implementation :
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.
- Write your own len() in Python
- Print with your own font using Python !!
- Create Your own Intents and Entities in Dialogflow Chatbot
- Create your own universal function in NumPy
- Change your way to put logic in your code - Python
- Program to print its own name as output
- Python | Program to implement Jumbled word game
- Python | Program to implement simple FLAMES game
- Implement IsNumber() function in Python
- Implement Canny Edge Detector in Python using OpenCV
- Python program to implement Rock Paper Scissor game
- Send mail from your Gmail account using Python
- Send mail with attachment from your Gmail account using Python
- Rename all file names in your directory using Python
- Python | Fetch your gmail emails from a particular user
- Get Your System Information - Using Python Script
- Customize your Python class with Magic or Dunder methods
- Python Script to Shutdown your PC using Voice Commands
- Creating Your First Application in Python
- How to implement Dictionary with Python3?
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.