NLP | Distributed Tagging with Execnet – Part 2
The gateway’s remote_exec() method takes a single argument that can be one of the following three types:
- A string of code to execute remotely
- The name of a pure function that will be serialized and executed remotely
- The name of a pure module whose source will be executed remotely
Code : Using the remote_tag.py module with three options
What is Pure Module?
- A pure module is a module that is self-contained: it can only access Python modules that are available where it executes and does not have access to any variables or states that exist wherever the gateway is initially created.
- Similarly, a pure function is a self-contained function, with no external dependencies.
- To detect that the module is being executed by execnet, check the __name__ variable. If it’s equal to ‘__channelexec__’, then it is being used to create a remote channel.
- This is similar to doing if __name__ == ‘__main__’ to check if a module is being executed on the
- The first thing to do is calling channel.receive() to get the serialized tagger, which is loaded using pickle.loads()
- It is noticed that channel is not imported anywhere—that’s because it is included in the global namespace of the module. Any module that execnet executes remotely has access to the channel variable in order to communicate with the channel creator.
- After having the tagger, tag() each tokenized sentence iteratively, that is received from the channel.
- This allows the user to tag as many sentences as the sender wants to send, as iteration will not stop until the channel is closed.
- So, a compute node for part-of-speech tagging is created that dedicates 100% of its resources to tagging whatever sentences it receives. As long as the channel remains open, the node is available for processing.
Execnet can do a lot more, such as opening multiple channels to increase parallel processing, as well as opening gateways to remote hosts over SSH to do distributed processing.
Creating multiple channels
Multiple channels are created, one per gateway, to make the processing more parallel. Each gateway creates a new subprocess (or remote interpreter if using an SSH gateway), and one channel per gateway for communication is used. Once two channels are created, they can be combined using the MultiChannel class, which allows the user to iterate over the channels and make a receive queue to receive messages from each channel.
After creating each channel and sending the tagger, the channels are cycled through to send an even number of sentences to each channel for tagging. Then, all the responses are collected from the queue. A call to queue.get() will return a 2-tuple of (channel, message) in case it is required to know which channel the message came from. Once all the tagged sentences have been collected, gateways can be exit easily.
Length : 4
In the example code, only four sentences are sent, but in real life, one needs to send thousands. A single computer can tag four sentences very quickly, but when thousands or hundreds of thousands of sentences need to be tagged, sending sentences to multiple computers can be much faster than waiting for a single computer to do it all.