NLP | Storing Frequency Distribution in Redis
The nltk.probability.FreqDist class is used in many classes throughout NLTK for storing and managing frequency distributions. It’s quite useful, but it’s all in-memory, and doesn’t provide a way to persist the data. A single FreqDist is also not accessible to multiple processes. All that can be changed by building a FreqDist on top of Redis.
What is Redis?
- Redis is a data structure server that is one of the more popular NoSQL databases.
- Among other things, it provides a network-accessible database for storing dictionaries (also known as hash maps).
- Building a FreqDist interface to a Redis hash map will allow us to create a persistent FreqDist that is accessible to multiple local and remote processes at the same time.
Attention reader! Don’t stop learning now. Get hold of all the important Machine Learning Concepts with the Machine Learning Foundation Course at a student-friendly price and become industry ready.
- Install both Redis and redis-py. The Redis website is at http://redis.io/ and includes many documentation resources.
- To use hash maps, install the latest version, which at the time of this writing is 2.8.9.
- The Redis Python driver, redis-py, can be installed using pip install redis or easy_install redis. The latest version at this time is 2.9.1.
- The redis-py home page is at http://github.com/andymccurdy/redis-py/.
- Once both are installed and a redis-server process is running, you’re ready to go. Let’s assume redis-server is running on localhost on port 6379 (the default host and port).
How it works?
- The FreqDist class extends the standard library collections.Counter class, which makes a FreqDist a small wrapper with a few extra methods, such as N().
- The N() method returns the number of sample outcomes, which is the sum of all the values in
the frequency distribution.
- An API-compatible class is created on top of Redis by extending a RedisHashMapand then implementing the N() method.
- The RedisHashFreqDist (defined in redisprob.py) sums all the values in the hash map for the N() method
Code : Explaining the working
This class can be used just like a FreqDist. To instantiate it, pass a Redis connection and the name of our hash map. The name should be a unique reference to this particular FreqDist so that it doesn’t clash with any other keys in Redis.
0 1 1
Most of the work is done in the RedisHashMap class, which extends collections.MutableMapping and then overrides all methods that require Redis-specific commands. Outline of each method that uses a specific Redis command:
- __len__() : This uses the hlen command to get the number of elements in thehash map
- __contains__(): This uses the hexists command to check if an element existsin the hash map
- __getitem__(): This uses the hget command to get a value from the hash map
- __setitem__(): This uses the hset command to set a value in the hash map
- __delitem__(): This uses the hdel command to remove a value from thehash map
- keys(): This uses the hkeys command to get all the keys in the hash map
- values(): This uses the hvals command to get all the values in the hash map
- items(): This uses the hgetall command to get a dictionary containing all the keys and values in the hash map
- clear(): This uses the delete command to remove the entire hash map from Redis