NLP | Distributed Tagging with Execnet – Part 1

What is Execnet?

  • Execnet is a distributed execution library for Python.
  • It allows to create gateways and channels for remote code execution.
  • A gateway is a connection from the calling process to a remote environment.
  • The remote environment can be a local subprocess or an SSH connection to a remote node.
  • A channel is created from a gateway and handles communication between the channel creator and the remote code.
  • In this way, execnet is a kind of Message Passing Interface (MPI), where the gateway creates the connection and the channel is used to send messages back and forth.

Since many NLTK processes take 100% CPU during computation, execnet is an ideal way to distribute that computation for maximum resource usage. One gateway per CPU core can be created, and it doesn’t matter whether the cores are in the local computer or spread across remote machines. In many situations, it is only required to have the trained objects and data on a single machine and can send the objects and data to the remote nodes as needed.
Installing execnet :
It should be as simple as sudo pip install execnet or sudo easy_install execnet. The current version of execnet, as of this writing, is 1.2. The execnet home page, which has API documentation and examples, is at http://codespeak.net/execnet/.

How it works?
Pickle needs to be imported so as to serialize (transmit) the tagger. Execnet does not natively know how to deal with complex objects such as a part-of-speech tagger, so the tagger is to be dumped to a string using pickle.dumps().
Default tagger is used that’s used by the nltk.tag.pos_tag() function, but any pre-trained part-of-speech tagger can be used as long as it implements the TaggerI interface. Execnet can be started by making a gateway with execnet.makegateway() after having a serialized tagger.
The default gateway creates a Python subprocess, and the remote_exec() function of the remote_tag module can be called to create a channel. With an open channel, one can send over the serialized tagger, followed by the first tokenized sentence of the treebank corpus.
Visually, the communication process looks like this

Now, calling channel.receive(), will return back a tagged sentence that is equivalent to the first tagged sentence in the treebank corpus, so it is known that the tagging worked. At last it is being end by exiting the gateway, which closes the channel and kills the subprocess.



filter_none

edit
close

play_arrow

link
brightness_4
code

import execnet, remote_tag, nltk.tag, nltk.data
from nltk.corpus import treebank
import pickle
  
pickled_tagger = pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER))
gw = execnet.makegateway()
  
channel = gw.remote_exec(remote_tag)
channel.send(pickled_tagger)
channel.send(treebank.sents()[0])
  
tagged_sentence = channel.receive()
  
# will give output
tagged_sentence == treebank.tagged_sents()[0]
  
gw.exit()

chevron_right


Output :

True


My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.