Skip to content
Related Articles

Related Articles

NLP | Parallel list processing with execnet
  • Last Updated : 12 Jun, 2019

This article presents a pattern for using execnet to process a list in parallel. It’s a function pattern for mapping each element in the list to a new value, using execnet to do the mapping in parallel.

In the code given below, integers are simply doubled, any pure computation can be performed. Given is the module, which will be executed by execnet. It receives a 2-tuple of (i, arg), assumes arg is a number and sends back (i, arg*2).

Code :




if __name__ == '__channelexec__':
    for (i, arg) in channel:
        channel.send((i, arg * 2))

To use this module to double every element in a list, import the plists module and call plists.map() with the remote_double module, and a list of integers to double.

Code : Using plist






import plists, remote_double
plists.map(remote_double, range(10))

Output :

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

The map() function is defined in plists.py. It takes a pure module, a list of arguments, and an optional list of 2-tuples consisting of (spec, count). The default specs are [(‘popen’, 2)], which means the user will open two local gateways and channels. Once these channels are opened, the user can put them into an itertools cycle, which creates an infinite iterator that cycles back to the beginning once it hits the end.

Now, each argument can be sent in args to a channel for processing, and since the channels are cycled, each channel gets an almost even distribution of arguments. This is where i comes in — the order in which the results come back is unknown, so i, as the index of each arg in the list, is passed to the channel and back so the user can combine the results in the original order. Then wait for the results with a MultiChannel receive queue and insert them into a prefilled list that’s the same length as the original args. After having all the expected results, exit the gateways and return the results as shown in the code given below –

Code :




import itertools, execnet
def map(mod, args, specs =[('popen', 2)]):
    gateways = []
    channels = []
      
    for spec, count in specs:
        for i in range(count):
            gw = execnet.makegateway(spec)
            gateways.append(gw)
            channels.append(gw.remote_exec(mod))
              
    cyc = itertools.cycle(channels)
      
    for i, arg in enumerate(args):
        channel = next(cyc)
        channel.send((i, arg))
    mch = execnet.MultiChannel(channels)
    queue = mch.make_receive_queue()
    l = len(args)
    # creates a list of length l, 
    # where every element is None
    results = [None] *
      
    for i in range(l):
        channel, (i, result) = queue.get()
        results[i] = result
          
    for gw in gateways:
        gw.exit()
    return results

Code : Increasing the parallelization by modifying the specs




plists.map(remote_double, range(10), [('popen', 4)])

Output :

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

However, more parallelization does not necessarily mean faster processing. It depends on the available resources, and the more gateways and channels being opened, the more overhead is required. Ideally, there should be one gateway and channel per CPU core to get maximum resource utilization. Use plists.map() with any pure module as long as it receives and sends back 2-tuples where i is the first element. This pattern is most useful when a bunch of numbers to crunch are present to be processed as quickly as possible.

machine-learning

My Personal Notes arrow_drop_up
Recommended Articles
Page :