Word Prediction using concepts of N – grams and CDF

Have some basic understanding aboutCDF and N – grams

Problem Statement – Given any input word and text file, predict the next n words that can occur after the input word in the text file.

Examples:



Input  :  is 
Output :  is it simply makes sure that there are never

Input  :  is
Output :  is split, all the maximum amount of objects, it

Input  :  the
Output : the exact same position. There will be some.

Note – For illustarting the example, I have assigned the variable corpus to some text. If you want to test data on real world text data, you can find the data here.

Solution – We can approach this problem using the concepts of probability. Firstly we must calculate the frequency of all the words occurring just after the input in the text file(n-grams, here it is 1-gram, because we always find the next 1 word in the whole data file). Then using those frequencies, calculate the CDF of all these words and just choose a random word from it. To choose this random word, we take a random number and find the smallest CDF greater than or equal the random number. We do so because we want the most probable answer for each case. So that can be achieved by cdf as it gives the cumulative probability for each word in the list.

After finding the CDF, we can easily find the corresponding word and append that word to the output string. Now, if you wish, you can also append the word to the input string and send the whole string to repeat the process to find the next word, or you can just send the word that you found out using cdf. I have done that using the former approach.
Note – You will get a different output if you enter the same word multiple times. That depends on the size of your data file. Larger the file, more probability of a different output.

Code for above algorithm

filter_none

edit
close

play_arrow

link
brightness_4
code

import random
from collections import Counter
  
# This function calulates the freq of the (i+1)th
# word in the whole corpus, where i is the index of
# the sentence or the word.
  
def next_word_freq(array, sentence):
      
    sen_len, word_list = len(sentence.split()), []
      
    for i in range(len(array)):
  
        # If the sentence matches the sentence in the range (i, i+x)
        # and the length is less than the length of the corpus, append
        # the word to word_list.
          
        if ' '.join(array[i : i + sen_len]).lower() == sentence.lower():
  
            if i + sen_len < len(array) - 1:
  
                word_list.append(array[i + sen_len])
  
    # Return the count of each word in word_list
      
    return dict(Counter(word_list))
  
# Calculate the CDF of each word in the
# Counter dictionary.
  
def CDF(d):
      
    prob_sum, sum_vals = 0, sum(d.values())
      
    for k, v in d.items():
  
        # Calculate the PMF of each word by dividing
        # the freq. by total of all frequencies then add
        # all the PMFs till ith word which is the CDF of
        # the ith word.
          
        pmf = v / sum_vals
        prob_sum += pmf
        d[k] = prob_sum
  
    # Return cdf dictionary
      
    return d
  
# The main function reads the sentence/word as input
# from user and reads the corpus file. For faster processing,
# we have taken only the first 1000 words.
  
  
def main(sent, x, n):
  
    # I am using this sample text here to illustrate the output.
    # If anyone wants to use a text file, he can use the same. The code
    # to read corpus from file has been commented below.
  
    # corpus = open('a.txt','r').read()
  
    corpus = '''text The chance is unlikely if not done programmatically. 
    However, imagine the game spawning multiple players at a spawn point,
    this would be the exact same location. I'm not quite sure what you 
    mean with spin,     what does the integer reflect? Why is it a 
    mismatch between data and structure? The structure does not
    assume a set amount of objects, it can be anything, that's why new
    nodes are created. It simply makes sure that there are not more than
    X leafs inside 1 node. The random is no option of course.
    My splitting algorithm always created the maximum amount of nodes 
    already, split over the current node. But I guess I have to change
    this behaviour? Actually, all the books have different authors. And
    most have a different location too. There will be some with the same
    location, but different authors, though. I think my library should be
    able to store books with the same position. There are never 
    equally-attractive leaf nodes. If a node is split, all childs will
    reflect a different part of the parent node.'''
      
    l = corpus.split()
  
    # "temp_out" will be used to store each partial sentence
    # which will later be stored into "sent". "out" is used to store
    # the final output.
      
    temp_out = ''
    out = sent + ' '
      
    for i in range(n - x):
  
        # calling the next_word_freq method that returns
        # the frequency of each word next to sent in the
        # whole word corpus.
          
        func_out = next_word_freq(l, sent)
  
        # cdf_dict stores the cdf of each word in the above map
        # that is calulated using method CDF.
          
        cdf_dict = CDF(func_out)
          
        # We use a random number to predict the next word.
        # The word having its CDF greater than or equal to rand
        # and less than or equal to 1.
          
        rand = random.uniform(0, 1)
  
        # If cdf_dict is empty, it means the word.sentence entered by you
        # does not exist in the corpus. Hence, break the loop and just print
        # the word entered by you. To implement this we use try-except block.
        # If an error occurs it implies there aren't enough values to unpack
        # and this can happen only when your input is absent from the corpus.
          
        try: key, val = zip(*cdf_dict.items())
        except: break
  
        # Iterate through the cdf values and find the smallest value
        # greater than or equal to the random number. That value is the
        # cdf of your predicted word. Add the key of the value to the output
        # string and update the "sent" variable as "temp_out".
          
        for j in range(len(val)):
              
            if rand <= val[j]:
                pos = j
                break
                      
        temp_out = key[pos]
        out = out + temp_out + ' '
        sent = temp_out
          
    print(out, end = '\n\n')
  
if __name__ == '__main__':
  
    inp_sent = 'is'
    # The output will have 10 words, including the input sentence/word.
    main(inp_sent, len(inp_sent), 10)
  
# Code contributed by Gagan Talreja.

chevron_right


The concept shown above is used in fields like Natural Langauage Processing. This is a naive approach just to illustrate the concept. Actually, there are much more algorithms out there for word prediction. You can find one of them here



My Personal Notes arrow_drop_up

Know more about me at https//gagantalrejagithubio

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.



Improved By : shubham_singh



Article Tags :

2


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.