Finding the Odd Word amongst given words using Word2Vec embeddings

Last Updated : 27 Jan, 2021

Odd One out the problem is one of the most interesting and goto problems when it comes to testing the logical reasoning skills of an individual. It is often used in many competitive exams and placement rounds as it checks the individual’s analytical skills and decision-making ability. In this article, we are going to write a python code that can be used to find the odd words amongst a given set of words.

Suppose, we are given a set of words like Apple, Mango, Orange, Party, Guava, and we have to find the odd word. We as a human can analyze and predict that Party is the odd word as all other words are names of fruit, but for a model to understand this and find this out is very difficult. Here, we will be using Word2Vec model and a pre-trained model named ‘GoogleNews-vectors-negative300.bin‘ which is trained on over 50 Billion words by Google. Each word inside the pre-trained dataset is embedded in a 300-dimensional space and the words which are similar in context/meaning are placed closer to each other in the space and have a high cosine similarity value.

Methodology to find out the odd word:

We will find the average vector of all the given word vectors, and then we compare cosimilarity value of each word vector with the average vector value, the word with the least cosimilarity will be our odd word.

Importing important libraries:

We need to install an additional gensim library, to use word2vec model, to install gensim use the command ‘pip install gensim‘ on your terminal/command prompt.

Python3

import numpy as np
import gensim
from gensim.models import word2vec,KeyedVectors
from sklearn.metrics.pairwise import cosine_similarity

Loading the word vectors using the pre-trained model:

Python3

vector_word_notations = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

Defining a function to predict the odd word:

Python3

def odd_word_out(input_words):
    '''The function accepts a list of word and returns the odd word.'''
     
    # Generate all word embeddings for the given list of words
     
    whole_word_vectors = [vector_word_notations[i] for i in input_words]
     
    # average vector for all word vectors
    mean_vector = np.mean(whole_word_vectors,axis=0)
     
    # Iterate over every word and find similarity
    odd_word = None
    minimum_similarity = 99999.0 # Can be any very high value
     
    for i in input_words:
        similarity = cosine_similarity([vector_word_notations[i]],[mean_vector])
        if similarity < minimum_similarity:
            minimum_similarity = similarity
            odd_word = i
     
        print("cosine similarity score between %s and mean_vector is %.3f"%(i,similarity))
     
    print("\nThe odd word is: "+odd_word)

Testing our model:

Python3

input_1 = ['apple','mango','juice','party','orange','guava'] # party is odd word
odd_word_out(input_1)

Output:

cosine similarity score between apple and mean_vector is 0.765
cosine similarity score between  mango and mean_vector is 0.808
cosine similarity score between juice and mean_vector is 0.688
cosine similarity score between party and mean_vector is 0.289
cosine similarity score between orange and mean_vector is 0.611
cosine similarity score between guava and mean_vector is 0.790

The odd word is: party

Similarly, for another example, let’s say:

Python

input_2 = ['India','paris','Russia','France','Germany','USA']
# paris is an odd word since it is a capital and other are countries
odd_word_out(input_2)

Output:

cosine similarity score between India and mean_vector is 0.660 
cosine similarity score between paris and mean_vector is 0.518
cosine similarity score between Russia and mean_vector is 0.691
cosine similarity score between France and mean_vector is 0.758
cosine similarity score between Germany and mean_vector is 0.763     
cosine similarity score between USA and mean_vector is 0.564

The odd word is: paris

Suggest improvement

Finding the Word Analogy from given words using Word2Vec embeddings

Share your thoughts in the comments