Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.
Word2Vec consists of models for generating word embedding. These models are shallow two layer neural networks having one input layer, one hidden layer and one output layer. Word2Vec utilizes two architectures :
- CBOW (Continuous Bag of Words) : CBOW model predicts the current word given context words within specific window. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent current word present at the output layer.
- Skip Gram : Skip gram predicts the surrounding context words within specific window given current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer.
The basic idea of word embedding is words that occur in similar context tend to be closer to each other in vector space. For generating word vectors in Python, modules needed are
Run these commands in terminal to install
pip install nltk pip install gensim
Download the text file used for generating word vectors from here .
Below is the implementation :
Cosine similarity between 'alice' and 'wonderland' - CBOW : 0.999249298413 Cosine similarity between 'alice' and 'machines' - CBOW : 0.974911910445 Cosine similarity between 'alice' and 'wonderland' - Skip Gram : 0.885471373104 Cosine similarity between 'alice' and 'machines' - Skip Gram : 0.856892599521
Output indicates the cosine similarities between word vectors ‘alice’, ‘wonderland’ and ‘machines’ for different models. One interesting task might be to change the parameter values of ‘size’ and ‘window’ to observe the variations in the cosine similarities.
Applications of Word Embedding : >> Sentiment Analysis >> Speech Recognition >> Information Retrieval >> Question Answering
- Implement your own word2vec(skip-gram) model in Python
- ML | T-distributed Stochastic Neighbor Embedding (t-SNE) Algorithm
- Second most repeated word in a sequence in Python
- Generating Word Cloud in Python
- Generating Word Cloud in Python | Set 2
- Python | Reverse each word in a sentence
- Python program to remove Nth occurrence of the given word
- Python program for word guessing game
- Find frequency of each word in a string in Python
- Python | Program that matches a word containing 'g' followed by one or more e's using regex
- Python | Program to implement Jumbled word game
- Find the first repeated word in a string in Python using Dictionary
- NLP | Likely Word Tags
- NLP | Word Collocations
- NLP | Synsets for a word in WordNet
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.