Open In App

Pre-trained Word embedding using Glove in NLP models

In this article, we are going to see Pre-trained Word embedding using Glove in NLP models using Python.

What is GloVe?

Global Vectors for Word Representation, or GloVe for short, is an unsupervised learning algorithm that generates vector representations, or embeddings, of words. Researchers Richard Socher, Christopher D. Manning, and Jeffrey Pennington first presented it in 2014. By using the statistical co-occurrence data of words in a given corpus, GloVe is intended to capture the semantic relationships between words.



The fundamental concept underlying GloVe is the representation of words as vectors in a continuous vector space, where the angle and direction of the vectors correspond to the semantic connections between the appropriate words. To do this, GloVe builds a co-occurrence matrix using word pairs and then optimizes the word vectors to minimize the difference between the pointwise mutual information of the corresponding words and the dot product of vectors.

Word embedding

In NLP models, we deal with texts which are human-readable and understandable. But the machine doesn’t understand texts, it only understands numbers. Thus, word embedding is the technique to convert each word into an equivalent float vector. Various techniques exist depending on the use case of the model and dataset. Some of the techniques are One Hot Encoding, TF-IDF, Word2Vec, and FastText.



Example: 

'the': [-0.123, 0.353, 0.652, -0.232]
'the' is very often used word in texts of any kind.
its equivalent 4-dimension dense vector has been given.

Glove Data

The glove has pre-defined dense vectors for around every 6 billion words of English literature along with many other general-use characters like commas, braces, and semicolons. The algorithm’s developers frequently make the pre-trained GloVe embeddings available. It is not necessary to train the model from scratch when using these pre-trained embeddings, which can be downloaded and used immediately in a variety of natural language processing (NLP) applications. Users can select a pre-trained GloVe embedding in a dimension (e.g., 50-d, 100-d, 200-d, or 300-d vectors) that best fits their needs in terms of computational resources and task specificity.

Here d stands for dimension. 100d means, in this file each word has an equivalent vector of size 100. Glove files are simple text files in the form of a dictionary. Words are key and dense vectors are values of key.

GloVe Embeddings Applications

GloVe embeddings are a popular option for representing words in text data and have found applications in various natural language processing (NLP) tasks. The following are some typical uses for GloVe embeddings:

Text Classification:

Named Entity Recognition (NER):

Machine Translation:

Question Answering Systems:

Document Similarity and Clustering:

Word Analogy Tasks:

Semantic Search:

How GloVe works?

The creation of a word co-occurrence matrix is the fundamental component of GloVe. This matrix provides a quantitative measure of the semantic affinity between words by capturing the frequency with which they appear together in a given context. Next, by minimising the difference between the dot product of vectors and the pointwise mutual information of corresponding words, GloVe optimises word vectors. GloVe is able to produce dense vector representations that capture syntactic and semantic relationships thanks to its innovative methodology.

Create Vocabulary Dictionary

Vocabulary is the collection of all unique words present in the training dataset. The first dataset is tokenized into words, then all the frequency of each word is counted. Then words are sorted in decreasing order of their frequencies. Words having high frequency are placed at the beginning of the dictionary.

Dataset= {The peon is ringing the bell}
Vocabulary= {'The':2, 'peon':1, 'is':1, 'ringing':1}

Algorithm for word embedding

Code Implementation:

Using TensorFlow’s Tokenizer, the code tokenizes a set of words and outputs vocabulary data. Lines for downloading and unzipping GloVe vectors are included, but they are commented out. The code defines a function that uses pre-trained GloVe vectors to create an embedding matrix for a given vocabulary. It prints the dense vector representation for the first word in the vocabulary after loading the GloVe vectors. Before executing the code, the download lines must be commented out.




# code for Glove word embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
 
x = {'text', 'the', 'leader', 'prime',
     'natural', 'language'}
 
# create the dict.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)
 
# number of unique words in dict.
print("Number of unique words in dictionary=",
      len(tokenizer.word_index))
print("Dictionary is = ", tokenizer.word_index)
 
# download glove and unzip it in Notebook.
#!unzip glove*.zip
 
# vocab: 'the': 1, mapping of words with
# integers in seq. 1,2,3..
# embedding: 1->dense vector
def embedding_for_vocab(filepath, word_index,
                        embedding_dim):
    vocab_size = len(word_index) + 1
     
    # Adding again 1 because of reserved 0 index
    embedding_matrix_vocab = np.zeros((vocab_size,
                                       embedding_dim))
 
    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix_vocab[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]
 
    return embedding_matrix_vocab
 
 
# matrix for vocab: word_index
embedding_dim = 50
embedding_matrix_vocab = embedding_for_vocab(
    '../glove.6B.50d.txt', tokenizer.word_index,
  embedding_dim)
 
print("Dense vector for first word is => ",
      embedding_matrix_vocab[1])

Output:

Number of unique words in dictionary= 6
Dictionary is = {'leader': 1, 'the': 2, 'prime': 3, 'natural': 4, 'language': 5, 'text': 6}
Dense vector for first word is => [-0.1567 0.26117 0.78881001 0.65206999 1.20019996 0.35400999
-0.34298 0.31702 -1.15020001 -0.16099 0.15798 -0.53501999
-1.34679997 0.51783001 -0.46441001 -0.19846 0.27474999 -0.26154
0.25531 0.33388001 -1.04130006 0.52525002 -0.35442999 -0.19137
-0.08964 -2.33139992 0.12433 -0.94405001 -1.02330005 1.35070002
2.55240011 -0.16897 -1.72899997 0.32548001 -0.30914 -0.63056999
-0.22211 -0.15589 -0.43597999 0.0568 -0.090885 0.75028002
-1.31529999 -0.75358999 0.82898998 0.051397 -1.48049998 -0.11134
0.27090001 -0.48712999]

Frequently Asked Questions (FAQs)

Q. 1 What is pre-trained word embedding in NLP?

They capture both the connotative and syntactic meaning of a word, and because they are trained on large datasets, they can improve the performance of a Natural Language Processing (NLP) model. These word embeddings are all practical for use in real-world situations and hackathons.

Q. 2 What is the word embedding GloVe model?

GloVe is a distributed word representation model that was developed by Global Vectors. The model is an algorithm for unsupervised learning that generates vector representations of words. This is accomplished by placing words into a meaningful space in which semantic similarity is correlated with word distance.

Q. 3 Why would you use a pre-trained model?

Pre-trained models are networks that have been saved after they have been trained on a sizable dataset, usually for a large-scale image classification task. Either apply transfer learning to tailor this model to a specific task, or use the pretrained model exactly as it is.

Q. 4 Which model works best for word embeddings?

The most widely used model for creating word embeddings is presumably Word2Vec. It was created by a group of Google researchers, and it has completely changed the natural language processing industry. To learn word associations from a large text corpus, the model makes use of a shallow neural network.


Article Tags :