Open In App

Word Embedding using Word2Vec

In this article, we are going to see Pre-trained Word embedding using Word2Vec in NLP models using Python.

What is Word Embedding?

Word Embedding is a language modeling technique for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrices, probabilistic models, etc. Word2Vec consists of models for generating word embedding. These models are shallow two-layer neural networks having one input layer, one hidden layer, and one output layer.



What is Word2Vec?

Word2Vec is a widely used method in natural language processing (NLP) that allows words to be represented as vectors in a continuous vector space. Word2Vec is an effort to map words to high-dimensional vectors to capture the semantic relationships between words, developed by researchers at Google. Words with similar meanings should have similar vector representations, according to the main principle of Word2Vec. Word2Vec utilizes two architectures:



The basic idea of word embedding is words that occur in similar context tend to be closer to each other in vector space. For generating word vectors in Python, modules needed are nltk and gensim. Run these commands in terminal to install nltk and gensim:

pip install nltk
pip install gensim

Why we need Word2Vec?

In natural language processing (NLP), Word2Vec is a popular and significant method for representing words as vectors in a continuous vector space. Word2Vec has become popular and is utilized in many different NLP applications for several reasons:

Word2Vec Code Implementation

Download the text file used for generating word vectors from here . Below is the implementation:

This code illustrates how to train the CBOW and Skip-Gram Word2Vec models on a given text file and shows how to use the trained models to compute the cosine similarity between particular word pairs. The models ability to understand the semantic relationships between words may vary depending on whether CBOW or Skip-Gram is used.




# Python program to generate word vectors using Word2Vec
 
# importing all necessary modules
from gensim.models import Word2Vec
import gensim
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
 
warnings.filterwarnings(action='ignore')
 
 
#  Reads ‘alice.txt’ file
sample = open("C:\\Users\\Admin\\Desktop\\alice.txt")
s = sample.read()
 
# Replaces escape character with space
f = s.replace("\n", " ")
 
data = []
 
# iterate through each sentence in the file
for i in sent_tokenize(f):
    temp = []
 
    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())
 
    data.append(temp)
 
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count=1,
                                vector_size=100, window=5)
 
# Print results
print("Cosine similarity between 'alice' " +
      "and 'wonderland' - CBOW : ",
      model1.wv.similarity('alice', 'wonderland'))
 
print("Cosine similarity between 'alice' " +
      "and 'machines' - CBOW : ",
      model1.wv.similarity('alice', 'machines'))
 
# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count=1, vector_size=100,
                                window=5, sg=1)
 
# Print results
print("Cosine similarity between 'alice' " +
      "and 'wonderland' - Skip Gram : ",
      model2.wv.similarity('alice', 'wonderland'))
 
print("Cosine similarity between 'alice' " +
      "and 'machines' - Skip Gram : ",
      model2.wv.similarity('alice', 'machines'))

Output :

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.999249298413
Cosine similarity between 'alice' and 'machines' - CBOW : 0.974911910445
Cosine similarity between 'alice' and 'wonderland' - Skip Gram : 0.885471373104
Cosine similarity between 'alice' and 'machines' - Skip Gram : 0.856892599521

Output indicates the cosine similarities between word vectors ‘alice’, ‘wonderland’ and ‘machines’ for different models. One interesting task might be to change the parameter values of ‘size’ and ‘window’ to observe the variations in the cosine similarities.  

Applications of Word Embedding:

Frequently Asked Quetions (FAQs)

Q.1 What is Word2Vec and how does it work?

Words can be represented as vectors in a continuous space using the Word2Vec technique. To operate, it must first learn vector representations of semantic relationships based on the distributional hypothesis, which states that words with comparable meanings occur in contexts that are similar across a corpus.

Q. 2 Describe the differences between Word2Vec’s CBOW and Skip-Gram architectures.

Skip-Gram predicts context words from a target word, and CBOW predicts a target word based on its context. While Skip-Gram frequently performs better for infrequent words, CBOW is faster and typically performs better with frequent words.

Q. 3 How are word embeddings trained using Word2Vec?

Using a large corpus, word embeddings are trained by modifying vector representations in response to how well the model predicts target or context words.

Q. 4 What is Cosine Similarity?

The similarity between two vectors in an inner product space is measured by cosine similarity. It finds whether two vectors are roughly pointing in the same direction by measuring the cosine of the angle between them. In text analysis, it is frequently used to gauge document similarity.


Article Tags :