NLP | Distributed Tagging with Execnet – Part 1
What is Execnet?
- Execnet is a distributed execution library for Python.
- It allows to create gateways and channels for remote code execution.
- A gateway is a connection from the calling process to a remote environment.
- The remote environment can be a local subprocess or an SSH connection to a remote node.
- A channel is created from a gateway and handles communication between the channel creator and the remote code.
- In this way, execnet is a kind of Message Passing Interface (MPI), where the gateway creates the connection and the channel is used to send messages back and forth.
Since many NLTK processes take 100% CPU during computation, execnet is an ideal way to distribute that computation for maximum resource usage. One gateway per CPU core can be created, and it doesn’t matter whether the cores are in the local computer or spread across remote machines. In many situations, it is only required to have the trained objects and data on a single machine and can send the objects and data to the remote nodes as needed.
Installing execnet :
It should be as simple as sudo pip install execnet or sudo easy_install execnet. The current version of execnet, as of this writing, is 1.2. The execnet home page, which has API documentation and examples, is at http://codespeak.net/execnet/.
How it works?
Pickle needs to be imported so as to serialize (transmit) the tagger. Execnet does not natively know how to deal with complex objects such as a part-of-speech tagger, so the tagger is to be dumped to a string using pickle.dumps().
Default tagger is used that’s used by the nltk.tag.pos_tag() function, but any pre-trained part-of-speech tagger can be used as long as it implements the TaggerI interface. Execnet can be started by making a gateway with execnet.makegateway() after having a serialized tagger.
The default gateway creates a Python subprocess, and the remote_exec() function of the remote_tag module can be called to create a channel. With an open channel, one can send over the serialized tagger, followed by the first tokenized sentence of the treebank corpus.
Visually, the communication process looks like this
Now, calling channel.receive(), will return back a tagged sentence that is equivalent to the first tagged sentence in the treebank corpus, so it is known that the tagging worked. At last it is being end by exiting the gateway, which closes the channel and kills the subprocess.