Basic Concepts of Optimizing for Parallelism And Locality

Last Updated : 18 Nov, 2022

In this article, we’ll discuss some of the basic concepts in compiler design that can be used to exploit parallelism and locality. We’ll start by looking at how computers work, and then move on to discuss the three types of parallelism available on modern processors: loop-level parallelism, data locality, and affine transforms. Finally, we’ll look at some techniques for exploiting these features in your own code to make it run faster when executed on multiple machines or within a single machine with multiple cores.

Multiprocessors:

Multiprocessors are a collection of processors that share memory. Multiprocessor systems have the advantage of being able to execute multiple operations in parallel, thus increasing performance. One example would be if you wanted to perform several computations on some data at once; this would be impossible with a single processor because each instruction needs time for execution before it can start another one.

Another application where multiprocessor systems are useful is in graphics processing units (GPUs). GPUs use hundreds or thousands of cores for their processing power, so they don’t need any additional hardware support from your computer’s motherboard or CPU core count increase to achieve high-performance levels like those offered by multicore CPUs. This is because GPUs are designed to be as efficient and streamlined as possible, so they don’t need massive amounts of memory or many CPU cores to do their job.

Parallelism in Applications:

The parallel processing of a problem can be achieved by running multiple copies of the program on different processors. However, as the number of processors increases, so does the complexity and cost of creating such an application. In order to overcome this problem, we need to design our algorithms so that they will scale well with increasing parallelism levels. This implies that we need to analyze how many resources are necessary for each task within our algorithm and then decide which tasks should be executed in parallel or sequentially based on their relative importance or resource requirements (Amdahl’s Law).

In addition, there exist theoretical limits on parallelism which may not be reached due to various reasons:

Data locality issues due to lack of memory access locality.
Communication costs between processors etc.

In today’s world, there are many applications that require the execution of multiple processes in parallel. Examples include web servers that serve many requests at once; multi-threaded applications, which perform several tasks simultaneously (for example, a word processor that allows us to type while it displays documents on screen).

Loop-Level Parallelism:

At the level of the loop nest, there are several tools for modifying loops to take advantage of parallelism. Some of these tools are:

Loop interchange: The ability to replace one loop with another without changing its execution order. This is achieved by storing data in memory and reading from it as necessary in each pass through the code.
Loop unrolling: Removing unused variables from a loop body by replacing them with constants instead of variables (or functions). This can be done either manually or automatically by compilation time optimizer based on analysis of code structure.
Distribution: The process where processors share the workload (i.e., run code) among themselves; this includes distributing tasks among different processors within one system or across multiple systems by interconnecting them via networks such as Ethernet/TCP/IP networks using specialized devices called load balancers so that each processor has access only when needed instead wasting resources waiting between requests.

The ability to partition a loop so that each thread performs only a portion of the task. This is done by using a parallelizing compiler (i.e., one that automatically detects and rewrites code for multiple processors).

Data Locality:

The data locality of a program is the extent to which the program data references are localized in memory. Data locality is important for performance because it allows the processor to access data in the same cache line.

Data locality can be measured by looking at how frequently your application references a particular type, function, or instruction and how close together those references are in memory. For example, if you want your code to run as fast as possible on disk (and thus not require any special hardware), then you want all kinds of functions and types that use disk I/O operations to be stored together so they can be reached quickly by reading from disks directly instead of having multiple copies spread out over different areas of memory.

Introduction to Affine Transform Theory:

The affine transform is a mapping that maps the coordinates of a point in space to another set of coordinates. Affine transforms can be used to optimize parallelism and locality, so we will use them in our discussion of compiler design.

An affine transform is defined by:

The data dependence property (DDP), states that each element in an array depends only on its immediate neighbors. This means that if you know the values at some index, you can calculate the value at other indexes because they are all related by dependencies with their neighbors.

The iteration space property (ISP), says that if we have an array whose elements are stored contiguously within each other, then there will be no gaps between them as we traverse through it from beginning to end without skipping any elements along this path (i.e., no “holes”).

The DDP and ISP together are sufficient for a compiler to be able to issue loads and stores in parallel without worrying about missing data. The only exception is if there are dependencies between different memory regions that can’t be broken by the compiler then it must wait until all of them have been computed before issuing the load or store instruction.

In order to make parallelization possible, the compiler must be able to determine which parts of an array can be accessed independently and which cannot. This can be done by applying the dependency analysis algorithm. This algorithm will tell us if there are any dependencies between two indexes, and therefore whether they can be stored in parallel or not.

Compiler techniques for exploiting parallelism and locality

To exploit parallelism and locality in a compiler, you should consider the following:

Using affine transform theory to partition data.
Using affine transform theory to partition loops.
Using affine transform theory to partition the iteration space.
Using affine transform theory to partition the data space.

Conclusion

The optimization techniques outlined in this article will help to make code as efficient as possible. In addition, they can be combined with other compiler optimizations to achieve even better results.

Suggest improvement

Matrix Multiply in Optimizing for Parallelism and Locality

Share your thoughts in the comments