Loop Level Parallelism in Computer Architecture
Since the beginning of multiprocessors, programmers have faced the challenge of how to take advantage of the power of process available. Sometimes parallelism is available but it is present in a form that is too complicated for the programmer to think about. In addition, there exists a large sequential code that has for years has incremental performance improvements afforded by the advancement of single-core execution. For a long time, automatic parallelization has been seen as a good solution to some of these challenges. Parallelization removes the programmer’s burden of expressing and understanding the parallelism existing in the algorithm.
Loop-level parallelism in computer architecture helps us with taking out parallel tasks within the loops in order to speed up the process. The utility for this parallelism arises where data is stored in random access data structures like arrays. A program that runs in sequence will iterate over the array and perform operations on indices at a time, a program that has loop-level parallelism will use multi-threads/ multi-processes that operate on the indices at the same time or at different times.
Loop Level Parallelism Types:
- DO-ALL parallelism(Independent multithreading (IMT))
- DO-ACROSS parallelism(Cyclic multithreading (CMT))
- DO-PIPE parallelism(Pipelined multithreading (PMT))
1. DO-ALL parallelism(Independent multi-threading (IMT)):
In DO-ALL parallelism every iteration of the loop is executed in parallel and completely independently with no inter-thread communication. The iterations are assigned to threads in a round-robin fashion, for example, if we have 4 cores then core 0 will execute iterations 0, 4, 8, 12, etc. (see Figure). This type of parallelization is possible only when the loop does not contain loop-carried dependencies or can be changed so that no conflicts occur between simultaneous iterations that are executing. Loops which can be parallelized in this way are likely to experience speedups since there is no overhead of inter-thread communication. However, the lack of communication also limits the applicability of this technique as many loops will not be amenable to this form of parallelization.
2. DO-ACROSS parallelism(Cyclic multi-threading (CMT)):
In DO-ACROSS parallelism, like Independent multi-threading, assigns iterations to threads in a round-robin manner. Optimization techniques described to increase parallelism in Independent multi-threading loops are also available in Cyclic multi-threading. In this technique, dependencies are identified by the compiler and the beginning of each loop iteration is delayed till all dependencies from previous iterations are satisfied. In this manner, the parallel portion of one iteration is overlapped with the sequential portion of the subsequent iteration. As a result, it ends up in parallel execution. For example, in the figure the statement x = x->next; causes a loop-carried dependence since it cannot be evaluated until the statement has been completed in the previous iteration. Once all cores have started their first iteration, this can approach linear speedup if the parallel part of the loop is very large to allow full utilization of the cores.
3. DO-PIPE parallelism(Pipeline multi-threading (PMT)):
DO-PIPE parallelism is the way for parallelization loops with cross-iteration dependencies. In this approach, the loop body is divided into a number of pipeline stages with each pipeline stage being assigned to a different core. Each iteration of the loop is then distributed across the cores with each stage of the loop being executed by the core which was assigned that pipeline stage. Each individual core only executes the code associated with the stage which was allocated to it. For instance, in the figure the loop body is divided into 4 stages: A, B, C, and D. Each iteration is distributed across all four cores but each stage is only executed by one core.