What are Word Embeddings?
It is an approach for representing words and documents. Word Embedding or Word Vector is a numeric vector input that represents a word in a lower-dimensional space. It allows words with similar meaning to have a similar representation. They can also approximate meaning. A word vector with 50 values can represent 50 unique features.
Features: Anything that relates words to one another. Eg: Age, Sports, Fitness, Employed etc. Each word vector has values corresponding to these features.
Goal of Word Embeddings
- To reduce dimensionality
- To use a word to predict the words around it
- Inter word semantics must be captured
How are Word Embeddings used?
- They are used as input to machine learning models.
Take the words —-> Give their numeric representation —-> Use in training or inference
- To represent or visualize any underlying patterns of usage in the corpus that was used to train them.
Implementations of Word Embeddings:
Word Embeddings are a method of extracting features out of text so that we can input those features into a machine learning model to work with text data. They try to preserve syntactical and semantic information. The methods such as Bag of Words(BOW), CountVectorizer and TFIDF rely on the word count in a sentence but do not save any syntactical or semantic information. In these algorithms, the size of the vector is the number of elements in the vocabulary. We can get a sparse matrix if most of the elements are zero. Large input vectors will mean a huge number of weights which will result in high computation required for training. Word Embeddings give a solution to these problems.
Let’s take an example to understand how word vector is generated by taking emoticons which are most frequently used in certain conditions and transform each emoji into a vector and the conditions will be our features.
Happy ???? ???? ???? Sad ???? ???? ???? Excited ???? ???? ???? Sick ???? ???? ????
The emoji vectors for the emojis will be: [happy,sad,excited,sick] ???? =[1,0,1,0] ???? =[0,1,0,1] ???? =[0,0,1,1] .....
In a similar way, we can create word vectors for different words as well on the basis of given features. The words with similar vectors are most likely to have the same meaning or are used to convey the same sentiment.
In this article we will be discussing two different approaches to get Word Embeddings:
In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot vector.
One-Hot vector: A representation where only one bit in a vector is 1.If there are 500 words in the corpus then the vector length will be 500. After assigning vectors to each word we take a window size and iterate through the entire corpus. While we do this there are two neural embedding methods which are used:
1.1) Continuous Bowl of Words(CBOW)
In this model what we do is we try to fit the neighboring words in the window to the central word.
1.2) Skip Gram
In this model, we try to make the central word closer to the neighboring words. It is the complete opposite of the CBOW model. It is shown that this method produces more meaningful embeddings.
After applying the above neural embedding methods we get trained vectors of each word after many iterations through the corpus. These trained vectors preserve syntactical or semantic information and are converted to lower dimensions. The vectors with similar meaning or semantic information are placed close to each other in space.
This is another method for creating word embeddings. In this method, we take the corpus and iterate through it and get the co-occurence of each word with other words in the corpus. We get a co-occurence matrix through this. The words which occur next to each other get a value of 1, if they are one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus:
Corpus: It is a nice evening. Good Evening! Is it a nice evening?
The upper half of the matrix will be a reflection of the lower half. We can consider a window frame as well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps gather information about the context in which the word is used.
Initially, the vectors for each word is assigned randomly. Then we take two pairs of vectors and see how close they are to each other in space. If they occur together more often or have a higher value in the co-occurence matrix and are far apart in space then they are brought close to each other. If they are close to each other but are rarely or not frequently used together then they are moved further apart in space.
After many iterations of the above process, we’ll get a vector space representation that approximates the information from the co-occurence matrix. The performance of GloVe is better than Word2Vec in terms of both semantic and syntactic capturing.
Pre-trained Word Embedding Models:
People generally use pre-trained models for word embeddings. Few of them are:
- Flair etc.
Common Errors made:
- You need to use the exact same pipeline during deploying your model as were used to create the training data for the word embedding. If you use a different tokenizer or different method of handling white space, punctuation etc. you might end up with incompatible inputs.
- Words in your input that doesn’t have a pre-trained vector. Such words are known as Out of Vocabulary Word(oov). What you can do is replace those words with “UNK” which means unknown and then handle them separately.
- Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of length say 400 and then try to apply vectors of length 1000 at inference time, you will run into errors. So make sure to use the same dimensions throughout.
Benefits of using Word Embeddings:
- It is much faster to train than hand build models like WordNet(which uses graph embeddings)
- Almost all modern NLP applications start with an embedding layer
- It Stores an approximation of meaning
Drawbacks of Word Embeddings:
- It can be memory intensive
- It is corpus dependent. Any underlying bias will have an effect on your model
- It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether etc.