Pre-trained Word embedding using Glove in NLP models

In this article, we are going to see Pre-trained Word embedding using Glove in NLP models using Python.

What is GloVe?

Global Vectors for Word Representation, or GloVe for short, is an unsupervised learning algorithm that generates vector representations, or embeddings, of words. Researchers Richard Socher, Christopher D. Manning, and Jeffrey Pennington first presented it in 2014. By using the statistical co-occurrence data of words in a given corpus, GloVe is intended to capture the semantic relationships between words.

The fundamental concept underlying GloVe is the representation of words as vectors in a continuous vector space, where the angle and direction of the vectors correspond to the semantic connections between the appropriate words. To do this, GloVe builds a co-occurrence matrix using word pairs and then optimizes the word vectors to minimize the difference between the pointwise mutual information of the corresponding words and the dot product of vectors.

Word embedding

In NLP models, we deal with texts which are human-readable and understandable. But the machine doesn’t understand texts, it only understands numbers. Thus, word embedding is the technique to convert each word into an equivalent float vector. Various techniques exist depending on the use case of the model and dataset. Some of the techniques are One Hot Encoding, TF-IDF, Word2Vec, and FastText.

Example:

'the': [-0.123, 0.353, 0.652, -0.232]
'the' is very often used word in texts of any kind. 
its equivalent 4-dimension dense vector has been given.

Glove Data

The glove has pre-defined dense vectors for around every 6 billion words of English literature along with many other general-use characters like commas, braces, and semicolons. The algorithm’s developers frequently make the pre-trained GloVe embeddings available. It is not necessary to train the model from scratch when using these pre-trained embeddings, which can be downloaded and used immediately in a variety of natural language processing (NLP) applications. Users can select a pre-trained GloVe embedding in a dimension (e.g., 50-d, 100-d, 200-d, or 300-d vectors) that best fits their needs in terms of computational resources and task specificity.

Here d stands for dimension. 100d means, in this file each word has an equivalent vector of size 100. Glove files are simple text files in the form of a dictionary. Words are key and dense vectors are values of key.

GloVe Embeddings Applications

GloVe embeddings are a popular option for representing words in text data and have found applications in various natural language processing (NLP) tasks. The following are some typical uses for GloVe embeddings:

Text Classification:

GloVe embeddings can be utilised as features in machine learning models for sentiment analysis, topic classification, spam detection, and other applications.

Named Entity Recognition (NER):

By capturing the semantic relationships between words and enhancing the model’s capacity to identify entities in text, GloVe embeddings can improve the performance of NER systems.

Machine Translation:

GloVe embeddings can be used to represent words in the source and target languages in machine translation systems, which aim to translate text from one language to another, thereby enhancing the quality of the translation.

Question Answering Systems:

To help models comprehend the context and relationships between words and produce more accurate answers, GloVe embeddings are used in question-answering tasks.

Document Similarity and Clustering:

GloVe embeddings enable applications in information retrieval and document organization by measuring the semantic similarity between documents or grouping documents according to their content.

Word Analogy Tasks:

In word analogy tasks, GloVe embeddings frequently yield good results. For instance, the generated vector for “king-man + woman” might resemble the “queen” vector, demonstrating the capacity to recognize semantic relationships.

Semantic Search:

In semantic search applications, where retrieving documents or passages according to their semantic relevance to a user’s query is the aim, GloVe embeddings are helpful.

How GloVe works?

The creation of a word co-occurrence matrix is the fundamental component of GloVe. This matrix provides a quantitative measure of the semantic affinity between words by capturing the frequency with which they appear together in a given context. Next, by minimising the difference between the dot product of vectors and the pointwise mutual information of corresponding words, GloVe optimises word vectors. GloVe is able to produce dense vector representations that capture syntactic and semantic relationships thanks to its innovative methodology.

Create Vocabulary Dictionary

Vocabulary is the collection of all unique words present in the training dataset. The first dataset is tokenized into words, then all the frequency of each word is counted. Then words are sorted in decreasing order of their frequencies. Words having high frequency are placed at the beginning of the dictionary.

Dataset= {The peon is ringing the bell}
Vocabulary= {'The':2, 'peon':1, 'is':1, 'ringing':1}

Algorithm for word embedding

Preprocess the text data.
Created the dictionary.
Traverse the glove file of a specific dimension and compare each word with all words in the dictionary,
if a match occurs, copy the equivalent vector from the glove and paste into embedding_matrix at the corresponding index.

Code Implementation:

Using TensorFlow’s Tokenizer, the code tokenizes a set of words and outputs vocabulary data. Lines for downloading and unzipping GloVe vectors are included, but they are commented out. The code defines a function that uses pre-trained GloVe vectors to create an embedding matrix for a given vocabulary. It prints the dense vector representation for the first word in the vocabulary after loading the GloVe vectors. Before executing the code, the download lines must be commented out.

Python3

# code for Glove word embedding

from tensorflow.keras.preprocessing.text import Tokenizer

from tensorflow.keras.preprocessing.sequence import pad_sequences

import numpy as np
 
x = {'text', 'the', 'leader', 'prime',

     'natural', 'language'}
 
# create the dict.

tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)
 
# number of unique words in dict.

print("Number of unique words in dictionary=", 

      len(tokenizer.word_index))

print("Dictionary is = ", tokenizer.word_index)
 
# download glove and unzip it in Notebook.
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip glove*.zip
 
# vocab: 'the': 1, mapping of words with
# integers in seq. 1,2,3..
# embedding: 1->dense vector

def embedding_for_vocab(filepath, word_index,

                        embedding_dim):

    vocab_size = len(word_index) + 1

    # Adding again 1 because of reserved 0 index

    embedding_matrix_vocab = np.zeros((vocab_size,

                                       embedding_dim))
 
    with open(filepath, encoding="utf8") as f:

        for line in f:

            word, *vector = line.split()

            if word in word_index:

                idx = word_index[word]

                embedding_matrix_vocab[idx] = np.array(

                    vector, dtype=np.float32)[:embedding_dim]
 
    return embedding_matrix_vocab
 
# matrix for vocab: word_index

embedding_dim = 50

embedding_matrix_vocab = embedding_for_vocab(

    '../glove.6B.50d.txt', tokenizer.word_index,

  embedding_dim)
 
print("Dense vector for first word is => ",

      embedding_matrix_vocab[1])

Output:

Number of unique words in dictionary= 6
Dictionary is =  {'leader': 1, 'the': 2, 'prime': 3, 'natural': 4, 'language': 5, 'text': 6}
Dense vector for first word is =>  [-0.1567      0.26117     0.78881001  0.65206999  1.20019996  0.35400999
 -0.34298     0.31702    -1.15020001 -0.16099     0.15798    -0.53501999
 -1.34679997  0.51783001 -0.46441001 -0.19846     0.27474999 -0.26154
  0.25531     0.33388001 -1.04130006  0.52525002 -0.35442999 -0.19137
 -0.08964    -2.33139992  0.12433    -0.94405001 -1.02330005  1.35070002
  2.55240011 -0.16897    -1.72899997  0.32548001 -0.30914    -0.63056999
 -0.22211    -0.15589    -0.43597999  0.0568     -0.090885    0.75028002
 -1.31529999 -0.75358999  0.82898998  0.051397   -1.48049998 -0.11134
  0.27090001 -0.48712999]

Frequently Asked Questions (FAQs)

Q. 1 What is pre-trained word embedding in NLP?

They capture both the connotative and syntactic meaning of a word, and because they are trained on large datasets, they can improve the performance of a Natural Language Processing (NLP) model. These word embeddings are all practical for use in real-world situations and hackathons.

Q. 2 What is the word embedding GloVe model?

GloVe is a distributed word representation model that was developed by Global Vectors. The model is an algorithm for unsupervised learning that generates vector representations of words. This is accomplished by placing words into a meaningful space in which semantic similarity is correlated with word distance.

Q. 3 Why would you use a pre-trained model?

Pre-trained models are networks that have been saved after they have been trained on a sizable dataset, usually for a large-scale image classification task. Either apply transfer learning to tailor this model to a specific task, or use the pretrained model exactly as it is.

Q. 4 Which model works best for word embeddings?

The most widely used model for creating word embeddings is presumably Word2Vec. It was created by a group of Google researchers, and it has completely changed the natural language processing industry. To learn word associations from a large text corpus, the model makes use of a shallow neural network.

Article Tags :

AI-ML-DS

NLP

Python

Natural-language-processing