NLP | How to score words with Execnet and Redis

Distributed word scoring can be performed using Redis and Execnet together. For each word in movie_reviews corpus, FreqDist and ConditionalFreqDist are used to calculate information gain.
Using >RedisHashFreqDist and a RedisConditionalHashFreqDist, same thing can be performed with Redis. The scores are then stored in a RedisOrderedDict. In order to get a better performance out of Redis, Execnet is used to distribute the counting.
Redis and execnet are used to together to do distributed word scoring. The information gain of each word in the is calculated using a FreqDist and ConditionalFreqDist. Now with Redis, the same thing can be performed using a RedisHashFreqDist and a RedisConditionalHashFreqDist, and then store the scores in a RedisOrderedDict. Execnet can be used to distribute the counting in order to get a better performance out of Redis. Instance of redis-server must be running on localhost after the installation of Redis and execnet.

Steps:

  • For each label in the movie_reviews corpus (which only has pos and neg labels), start by getting a list of tuples – labels and words.
  • Then from the dist_featx module, get the word_scores using score_words().
  • Total number of words are 39, 764 and word_scores function is an instance of RedisOrderedDict.
  • Then get the top 1, 000 words and inspect the top five using the keys() method to see what they are.
  • Delete the keys in Redis after getting all the required from word_scores as there is no more need of the data.

Code :



filter_none

edit
close

play_arrow

link
brightness_4
code

# importing libraries
from dist_featx import score_words
from nltk.corpus import movie_reviews
  
# finding category via categoreies
category = movie_reviews.categories()
  
print ("Categories : ", category)
category_words = [
        (l, movie_reviews.words(categories = [l])) 
        for l in category]
  
# Scores
word_scores = score_words(category_words)
print ("Length : ", len(word_scores))
  
# top words
topn_words = word_scores.keys(end = 1000)
print ("Top Words : ", topn_words[0:5])
  
# Delete the keys in Redis after getting 
# all the required from word_scores
from redis import Redis
r = Redis()
print ([r.delete(key) for 
     key in ['word_fd', 'label_word_fd:neg',
             'label_word_fd:pos', 'word_scores']] )

chevron_right


Output :

Categories :  ['neg', 'pos']
Length : 39767
Top Words : [b'bad', b', ', b'and', b'?', b'movie']
[1, 1, 1, 1]

The score_words() is a function from dist_featx. But it is expected to wait for a while as it takes some time to complete. The overhead of using execnet and Redis means it will take significantly longer than a nondistributed, in-memory version of the function.

How it works?
The dist_featx.py module contains the score_words() function, which does the following :

  • Opens gateways and channels.
  • Sends initialization data to each channel.
  • For counting, it sends each (label, words) tuple over a channel.
  • Sends a done message to each channel.
  • Waits for a done reply back.
  • closes the channels and gateways.
  • Based on the counts calculates the score of each word.
  • Store the score in a RedisOrderedDict.

Score all the words and store the results, once the counting is finished. The code is given below :
Code :

filter_none

edit
close

play_arrow

link
brightness_4
code

# Importing library
import itertools, execnet, remote_word_count
from nltk.metrics import BigramAssocMeasures
from redis import Redis
from redisprob import RedisHashFreqDist, RedisConditionalHashFreqDist
from rediscollections import RedisOrderedDict
  
# Scoring the words
def score_words(category_words, 
                score_fn = BigramAssocMeasures.chi_sq, 
                host ='localhost', specs =[('popen', 2)]):
    gateways = []
    channels = []
      
    # counting
    for spec, count in specs:
        for i in range(count):
            gw = execnet.makegateway(spec)
            gateways.append(gw)
            channel = gw.remote_exec(remote_word_count)
            channel.send((host, 'word_fd', 'category_word_fd'))
            channels.append(channel)
              
    cyc = itertools.cycle(channels)
      
    # syncing the channel
    for category, words in category_words:
        channel = next(cyc)
        channel.send((category, list(words)))
          
    for channel in channels:
        channel.send('done')
        assert 'done' == channel.receive()
        channel.waitclose(5)
          
    for gateway in gateways:
        gateway.exit()
          
    r = Redis(host)
    # ferquency distribution
    fd = RedisHashFreqDist(r, 'word_fd')
    cfd = RedisConditionalHashFreqDist(r, 'category_word_fd')
    word_scores = RedisOrderedDict(r, 'word_scores')
    n_xx = cfd.N()
      
    for category in cfd.conditions():
        n_xi = cfd[category].N()
      
    for word, n_ii in cfd[category].iteritems():
        word = word.decode()
        n_ix = fd[word]
          
        if n_ii and n_ix and n_xi and n_xx:
            score = score_fn(n_ii, (n_ix, n_xi), n_xx)
            word_scores[word] = score
    # final word scores        
    return word_scores

chevron_right


A different scoring method should be used if there are more than two labels. To compare two labels, the scoring method will only be accurate. The requirements will dictate how you store word scores.
There are two kinds of data that can be received over the channel after having the instance –

  1. A done message : It signals that there is no more data coming in over the channel.
    Reply back with another done message, finally exit the loop to close the channel.
  2. A 2-tuple of (label, words): It is used to iterate over to increment counts in both the RedisHashFreqDist and RedisConditionalHashFreqDist


My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.