NLP | Storing Frequency Distribution in Redis

The nltk.probability.FreqDist class is used in many classes throughout NLTK for storing and managing frequency distributions. It’s quite useful, but it’s all in-memory, and doesn’t provide a way to persist the data. A single FreqDist is also not accessible to multiple processes. All that can be changed by building a FreqDist on top of Redis.
What is Redis?

Redis is a data structure server that is one of the more popular NoSQL databases.
Among other things, it provides a network-accessible database for storing dictionaries (also known as hash maps).
Building a FreqDist interface to a Redis hash map will allow us to create a persistent FreqDist that is accessible to multiple local and remote processes at the same time.

Installation :

Install both Redis and redis-py. The Redis website is at http://redis.io/ and includes many documentation resources.
To use hash maps, install the latest version, which at the time of this writing is 2.8.9.
The Redis Python driver, redis-py, can be installed using pip install redis or easy_install redis. The latest version at this time is 2.9.1.
The redis-py home page is at http://github.com/andymccurdy/redis-py/.
Once both are installed and a redis-server process is running, you’re ready to go. Let’s assume redis-server is running on localhost on port 6379 (the default host and port).

How it works?

The FreqDist class extends the standard library collections.Counter class, which makes a FreqDist a small wrapper with a few extra methods, such as N().
The N() method returns the number of sample outcomes, which is the sum of all the values in
the frequency distribution.
An API-compatible class is created on top of Redis by extending a RedisHashMapand then implementing the N() method.
The RedisHashFreqDist (defined in redisprob.py) sums all the values in the hash map for the N() method

Code : Explaining the working

from rediscollections import RedisHashMap 

class RedisHashFreqDist(RedisHashMap): 

    def N(self): 

        return int(sum(self.values())) 

    def __missing__(self, key): 

        return 0

    def __getitem__(self, key): 

        return int(RedisHashMap.__getitem__(self, key) or 0) 

    def values(self): 

        return [int(v) for v in RedisHashMap.values(self)] 

    def items(self): 

        return [(k, int(v)) for (k, v) in RedisHashMap.items(self)]

This class can be used just like a FreqDist. To instantiate it, pass a Redis connection and the name of our hash map. The name should be a unique reference to this particular FreqDist so that it doesn’t clash with any other keys in Redis.

Code:

from redis import Redis 

from redisprob import RedisHashFreqDist 

r = Redis() 

rhfd = RedisHashFreqDist(r, 'test') 

print (len(rhfd)) 

rhfd['foo'] += 1

print (rhfd['foo']) 

rhfd.items() 

print (len(rhfd))

Output :

0
1
1

Most of the work is done in the RedisHashMap class, which extends collections.MutableMapping and then overrides all methods that require Redis-specific commands. Outline of each method that uses a specific Redis command:

__len__() : This uses the hlen command to get the number of elements in thehash map
__contains__(): This uses the hexists command to check if an element existsin the hash map
__getitem__(): This uses the hget command to get a value from the hash map
__setitem__(): This uses the hset command to set a value in the hash map
__delitem__(): This uses the hdel command to remove a value from thehash map
keys(): This uses the hkeys command to get all the keys in the hash map
values(): This uses the hvals command to get all the values in the hash map
items(): This uses the hgetall command to get a dictionary containing all the keys and values in the hash map
clear(): This uses the delete command to remove the entire hash map from Redis

Article Tags :

AI-ML-DS

Machine Learning

NLP

Python

Natural-language-processing