Open In App

Analytical Approach to optimize Multi Level Cache Performance

Last Updated : 21 Jan, 2021
Like Article

Prerequisite – Multilevel Cache Organization

The execution time of a program is the product of the total number of CPU cycles needed to execute a program. For a memory system with a single level of caching, the total cycle count is a function of the memory speed and the cache miss ratio.

m(C) = f(S, C, A, B) 
m(C) = Cache miss ratio
C = Cache Size
A = Associativity
S = Number of sets
B = Block Size

The portion of a particular cache that is attributable to the instruction data’s main memory is NMM and can be expressed in the form of a linear first-order function.

N_{total} = g(m(C), B, LA, TR, t_{CPU})\newline

The most significant of these is the latency (LA), between the start of memory fetch and transferring of the requested data. The number of cycles spent waiting on main memory is denoted by LA, TR is the transfer rate which is the maximum rate at which data can be transferred also called the Baud rate.

As the above image illustrates the aforementioned linear relationship between the miss rate and the total cycle count. This makes it clear that as the organizational parameters increase in value the miss rate declines asymptomatically. This, in turn, causes the total cycle count to decline.

The most difficult relationship to quantify is between the cycle time of the CPU and the parameters that govern it because it depends on the lowest levels of implementations. An equation for it may look something like.

t_{CPU} = h(C,S,A,B)[3]\\

Now the problem at hand becomes to relate system-level performance to the various cache and memory parameters.

Total execution time is a product of the cycle time and total cycle count.

T_{total} = t_{CPU}\times N_{total} = t_{L1}(C) \times N_{total} [4]\\

To obtain the minimum execution time we find the partial derivative with respect to some variable is equal to zero.

\frac{1}{t_{L1}}\times \frac{\partial t_{L1}}{\partial C} = -\frac{1}{N_{total}}\times \frac{\partial N_{total}}{\partial C}\\

For non-continuous discrete variables, the above equation has a different form which is.

\frac{1}{ t_{L1}} \times \frac{\Delta t_{L1}}{\Delta C}=-\frac{1}{N_{total}}\times \frac{\Delta N_{total}}{\Delta C}[5]\\

If for one period then the change is performance neutral. But if the left-hand side is greater than the change increases overall execution time and if the right-hand side is more positive than there is a net gain in performance.

For a single level of caching, the total number of cycles is given by the number of cycles spent not performing memory references; plus the number of cycles spent doing instruction fetches, loads, and stores; plus the time spent waiting on main memory.

N_{total} = N_{Execute}+N_{Ifetch}\times m(C)\times \bar n_{MMread} +N_{load}+N_{Load}\times m(C)\times \bar n_{MMread} +N_{Store}\times \bar n_{L1write}[6]\\

Nexecute = Number of cycles spent not performing a memory reference

NIfetch = Number of instruction fetches in the program or trace

NLoad = Number of loads

NStore = Number of Stores

nMMread = Average number of cycles spent satisfying a read cache miss

nL1write = Average number of cycles spent dealing with a write to the cache for RISC machines with single-cycle execution the number of cycles in which neither reference is active NExecute is zero.

We can now put all the operands of equation 6 which happen in parallel as 1 because they cumulatively take that much amount of time only. Then equation 6 becomes.

N_{total} = n_{Read}(1+m_{Read}(C)\times \bar n_{MMread})+N_{Store}\times \bar n_{L1write}[7]\\

For a machine with single-cycle execution and Harvard architecture capable of parallel instruction data and data reference generation loads contribute to the cycle count only if they miss in the data cache.

N_{Total} = N_{Ifetch} +N_{Ifetch}\times m(\textbf C_{L1t})\times \bar n_{MMread} +N_{Load}\times m(\textbf C_{L1D}\times \bar n_{MMread}) +N_{store}\times (n_{L1write}-1)[8]\\

Similar Reads

Factors affecting Cache Memory Performance
Computers are made of three primary blocs. A CPU, a memory, and an I/O system. The performance of a computer system is very much dependent on the speed with which the CPU can fetch instructions from the memory and write to the same memory. Computers are using cache memory to bridge the gap between the processor's ability to execute instructions and
5 min read
Difference between High Level and Low level languages
Both High level language and low level language are the programming languages's types. The main difference between high level language and low level language is that, Programmers can easily understand or interpret or compile the high level language in comparison of machine. On the other hand, Machine can easily understand the low level language in
1 min read
Cache Coherence
Prerequisite - Cache Memory Cache coherence : In a multiprocessor system, data inconsistency may occur among adjacent levels or within the same level of the memory hierarchy. In a shared memory multiprocessor with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memo
4 min read
Terminologies Cache Memory Organization
Cache Memory is a small, fast memory that holds a fraction of the overall contents of the memory. Its mathematical model is defined by its size, number of sets, associativity, block size, sub-block size, fetch strategy, and write strategy. Any node in the cache hierarchy can contain a common cache or two separate caches for instruction and or data.
4 min read
Simultaneous and Hierarchical Cache Accesses
Prerequisite : Cache Organization Introduction :In this article we will try to understand about Simultaneous Cache access as well as Hierarchical Cache Access in detail and also understand how these access actually works whenever CPU (Central Processing Unit) requests for Main Memory Block which is being stored currently in cache memory. Before jum
9 min read
Basic Cache Optimization Techniques
Generally, in any device, memories that are large(in terms of capacity), fast and affordable are preferred. But all three qualities can't be achieved at the same time. The cost of the memory depends on its speed and capacity. With the Hierarchical Memory System, all three can be achieved simultaneously. [caption width="800"]Memory Hierarchy[/captio
5 min read
Difference between Cache Coherence and Memory Consistency
1. Cache coherence :Cache coherence in computer architecture refers to the consistency of shared resource data that is stored in multiple local caches. When clients in a system maintain caches of a shared memory resource, problems with incoherent data can arise, which is especially true for CPUs in a multiprocessing system.In a shared memory multip
2 min read
Single Pass vs Two-Pass (Multi-Pass) Compilers
This article explores the concept of compiler passes in the field of software development, focusing on two types: the Single Pass Compiler and the Two-Pass Compiler (Multi-Pass Compiler). It explains their differences, advantages, and use cases, providing insights into the world of compiler design. What is a Compiler Pass?A Compiler pass refers to
4 min read
Debugging a machine level program
Debugging is the process of identifying and removing bug from software or program. It refers to identification of errors in the program logic, machine codes, and execution. It gives step by step information about the execution of code to identify the fault in the program. Debugging of machine code: Translating the assembly language to machine code
3 min read
Loop Level Parallelism in Computer Architecture
Since the beginning of multiprocessors, programmers have faced the challenge of how to take advantage of the power of process available. Sometimes parallelism is available but it is present in a form that is too complicated for the programmer to think about. In addition, there exists a large sequential code that has for years has incremental perfor
3 min read