Cache Oblivious Algorithm

Cache oblivious is a way of achieving algorithms that are efficient in arbitrary memory hierarchies without the use of complicated multi-level memory models. Cache oblivious algorithms are algorithms that use asymptotically optimal amounts of work, move data asymptotically optimally among multiple levels of cache, and indirectly use processor cache.

This article focuses on discussing the following topics:

What is Cache Oblivious Algorithm
Cache Oblivious Model.
Justification of the Cache Oblivious Model.
Why use Cache-Oblivious Algorithm?
Tall Cache Assumption.
Example of Cache Oblivious Algorithm.

Cache Oblivious Model

Cache Oblivious Models are built in a way so they can be independent of constant factors, like the size of the cache memory.

Features:

It is designed for memory hierarchy which separates computer storage into a hierarchy based on response time. Cache oblivious algorithms are asymptotic only when they ignore the constant factors.
Cache oblivious notion is to design cache-efficient and disk-efficient algorithms and data structures.
Cache-oblivious algorithms perform well on a multilevel memory hierarchy without knowing any parameters of the hierarchy, but to have the consciousness of their existence. This means that they just need to ignore the constant factors to work efficiently.
There are many ways to implement cache-oblivious algorithms one of them is through matrix transposition, which is then sorted through various sorting methods.

Justification of the Model

The Cache Oblivious Model can be justified based on the following points:

External Memory Model

The model shown below has a 2-level memory hierarchy which consists of cache (Z) and blocks of data (B) which transfer between the cache and the disk. The disk is split up into blocks of B elements, and accessing one of them on disk, copies its entire block to the cache.
The cache is very close to the CPU and has limited space, the disk is far from the CPU and expensive to access in technical terms it is known as cache-cost. And it has a lot of space.
The access time makes a whole lot of difference between the cache and disk. The general idea behind the read operation is to read the formerly stored data. and the write operation stores a new value in memory. But what happens is when the CPU or processor accesses a memory location, if that block of data is already in the cache, then it is known as cache-hit, cache-cost to access it is 0. And if it is not in cache already, then it is a cache-miss. This memory is then accessed from the disk and transferred to the cache (Z). The cache cost would be 1 for this case.

Associative Cache:

The cache is generally distinguished by three factors-
- Associativity(A)- The associativity A speciﬁes the no. of different frames or lines(B) resided to the main memory. If the block from the main memory(Disk) can reside in any frame or line then the associativity is fully satisfied.
- Block(B)- Block is the part of the minimum memory access size.
- Capacity(C)- Capacity is part of the minimum memory access size.
A cache is equal to C bytes. But due to physical coercion, the cache is divided into cache frames of size B. These factors can have an effect on a particular cache.
When a memory address is not in the cache (cache-miss), we must bring a block or line to the cache and it should decide where it should be mapped to in the cache. The Cache model presumes that any cache line or block can be mapped to any location in the cache memory. Most caches are 2-way, 4-way, 8-way, 16-way associative.
What is set-associative mapping?
- It is the mapping of a particular block of main memory(Disk, in our case) to a cache set. This occurs when a cache-miss happens.
- Cache lines are assembled into sets where each set contains k number of lines.
- Then, a block of main memory is mapped to a set of cache. We can map to any freely available cache-line from the main memory.
- The set of the cache to which a unique main-memory can be mapped is given by-

Cache set number = [Main Memory block address] modulo [number of sets in the cache]

Optimal Cache Replacement Policy:

When a cache miss occurs, a new cache line is mapped from the main memory to the cache.
But before fetching a block from the main memory if the cache is already full, there must be a way to evict the current existing line from the cache. There is no optimal replacement in reality because it requires us to know the future cache-miss which is unpredictable. Nonetheless, the optimal cache replacement policy can be resembled with actual policies and can be made more predictable using Bélády’s algorithm, First In First Out(FIFO), Last In First Out(LIFO), Least Recently Used(LRU), etc.

Why use Cache-Oblivious Algorithm?

The cache-oblivious model is invented so that a data structure can reflect some properties of cache consciousness.
Cache oblivious algorithm inherits some properties of register machines which usually consists of a small amount of fast storage and random access machine model.
But they have their own discrepancies. The register is a constantly accessible location available to the processor of a computer system. They consist of small memories of fast storage. These small memories are used when a computer loads data from a sizeable memory into registers to do arithmetic operations.
There is a reason why the cache-oblivious are cache conscious; i.e. because of a hierarchy of cache memory which is inspected through the cache-oblivious model.
The principle which holds all of them together is to design external memory algorithms without knowing the size of the cache and the blocks in the cache.

Tall Cache Assumption

The ideal cache model is an assumption which has an assumption called “tall cache”, which is used to calculate the cache complexity of an algorithm. This assertion has a mathematical equation-

Z = Ω(B²)

Here,
Z is the size of the cache.
B is the size of the cache line.
Ω symbol is used to represent the lower bound of the algorithm or data. And that is the top speed any algorithm can get to.

Examples of Cache-Oblivious Algorithm

1. Array Reversal:

Array reversal is reversing the elements in an array without any extra storage. Bentley’s array-reversal algorithm constructs two parallel scans from both sides, each from the opposite ends of the array. At each scan or step, the two elements trade their position to each other.

Advantages:
- The cache-oblivious algorithm uses the same number of memory reads as a single scan.
- The algorithm helps to implement the N-element traversal by scanning the elements one by one in the order they are stored.
Limitation:
- The algorithms perform worse than the RAM-based and cache-aware algorithms when data are stocked into the main memory.

2. Matrix Transpose:

The matrix transposition is defined as follows. Given an m × n matrix stored in a row-major layout, we have to compute and store A^T (A Transpose) into an n × m matrix B which is also stored in a row-major layout.
- The straightforward algorithm for transposition that employs doubly nested loops incurs Θ(mn) cache misses on one of the matrices. The reason behind the cache misses is that the size of the matrix increases.
- A cache miss occurs as a result of one block being loaded from the main memory into the cache.
- Optimal work and cache complexities can be obtained with a divide-and-conquer strategy, however,

If n >= m,
we partition
A = (A_1,A₂),
B = (B_1,B₂)

Then, repeatedly execute TRANSPOSE (A1, B1) and TRANSPOSE (A2, B2). Similarly, if m > n, we divide matrix A horizontally and matrix B vertically and likewise perform two transpositions repeatedly.

C++

#include <iostream>
#include <vector>
 
#define BLOCK_SIZE 64
 
void transpose(int n, int m, std::vector<std::vector<int>> &A)
{

    for (int i = 0; i < n; i += BLOCK_SIZE) {

        for (int j = 0; j < m; j += BLOCK_SIZE) {

            for (int k = i; k < i + BLOCK_SIZE && k < n; ++k) {

                for (int l = j; l < j + BLOCK_SIZE && l < m; ++l) {

                    int temp = A[k][l];

                    A[k][l] = A[l][k];

                    A[l][k] = temp;

                }

            }

        }

    }
}
 
int main()
{

    int n = 1024, m = 1024;

    std::vector<std::vector<int>> A(n, std::vector<int>(m, 0));
 
    // Initialize the matrix with some values

    for (int i = 0; i < n; ++i) {

        for (int j = 0; j < m; ++j) {

            A[i][j] = i * n + j;

        }

    }
 
    transpose(n, m, A);
 
    return 0;
}

// C code for implementing above approach
#include <stdio.h>
#include <stdlib.h>
 
#define BLOCK_SIZE 64
 
void transpose(int n, int m, int A[n][m])
{

    for (int i = 0; i < n; i += BLOCK_SIZE) {

        for (int j = 0; j < m; j += BLOCK_SIZE) {

            for (int k = i; k < i + BLOCK_SIZE && k < n;

                 ++k) {

                for (int l = j; l < j + BLOCK_SIZE && l < m;

                     ++l) {

                    int temp = A[k][l];

                    A[k][l] = A[l][k];

                    A[l][k] = temp;

                }

            }

        }

    }
}
 
// Driver's code

int main(int argc, char* argv[])
{

    int n = 1024, m = 1024;

    int A[n][m];
 
    // Initialize the matrix with some values

    for (int i = 0; i < n; ++i) {

        for (int j = 0; j < m; ++j) {

            A[i][j] = i * n + j;

        }

    }
 
    transpose(n, m, A);
 
    return 0;
}

Java

import java.util.Arrays;
 
public class Main{
 
    // Define block size as a constant

    private static final int BLOCK_SIZE = 64;
 
    // Function to transpose a matrix using block-wise transposition

    private static void transpose(int n, int m, int[][] A) {

        for (int i = 0; i < n; i += BLOCK_SIZE) {

            for (int j = 0; j < m; j += BLOCK_SIZE) {

                // Process each block within the matrix

                for (int k = i; k < i + BLOCK_SIZE && k < n; ++k) {

                    for (int l = j; l < j + BLOCK_SIZE && l < m; ++l) {

                        // Swap elements across the diagonal

                        int temp = A[k][l];

                        A[k][l] = A[l][k];

                        A[l][k] = temp;

                    }

                }

            }

        }

    }
 
    public static void main(String[] args) {

        int n = 1024, m = 1024;

        int[][] A = new int[n][m];
 
        // Initialize the matrix with some values

        for (int i = 0; i < n; ++i) {

            for (int j = 0; j < m; ++j) {

                A[i][j] = i * n + j;

            }

        }
 
        // Perform matrix transposition

        transpose(n, m, A);

    }
}

Python3

def transpose(n, m, A):

    BLOCK_SIZE = 64
 
    for i in range(0, n, BLOCK_SIZE):

        for j in range(0, m, BLOCK_SIZE):

            for k in range(i, min(i + BLOCK_SIZE, n)):

                for l in range(j, min(j + BLOCK_SIZE, m)):

                    # Swap elements between A[k][l] and A[l][k]

                    temp = A[k][l]

                    A[k][l] = A[l][k]

                    A[l][k] = temp
 
def main():

    n, m = 1024, 1024

    A = [[0] * m for _ in range(n)]
 
    # Initialize the matrix with some values

    for i in range(n):

        for j in range(m):

            A[i][j] = i * n + j
 
    # Call the transpose function

    transpose(n, m, A)
 
    return 0
 
if __name__ == "__main__":

    main()

using System;
 
public class Program
{

    // Define block size as a constant

    private const int BLOCK_SIZE = 64;
 
    // Function to transpose a matrix using block-wise transposition

    private static void Transpose(int n, int m, int[,] A)

    {

        for (int i = 0; i < n; i += BLOCK_SIZE)

        {

            for (int j = 0; j < m; j += BLOCK_SIZE)

            {

                // Process each block within the matrix

                for (int k = i; k < i + BLOCK_SIZE && k < n; ++k)

                {

                    for (int l = j; l < j + BLOCK_SIZE && l < m; ++l)

                    {

                        // Swap elements across the diagonal

                        int temp = A[k, l];

                        A[k, l] = A[l, k];

                        A[l, k] = temp;

                    }

                }

            }

        }

    }
 
    public static void Main(string[] args)

    {

        int n = 1024, m = 1024;

        int[,] A = new int[n, m];
 
        // Initialize the matrix with some values

        for (int i = 0; i < n; ++i)

        {

            for (int j = 0; j < m; ++j)

            {

                A[i, j] = i * n + j;

            }

        }
 
        // Perform matrix transposition

        Transpose(n, m, A);

    }
}
//This code is contributed by Utkarsh

Javascript

// Define the block size constant
const BLOCK_SIZE = 64;
 
// Function to transpose a matrix

function transpose(n, m, A) {

    // Iterate through matrix blocks

    for (let i = 0; i < n; i += BLOCK_SIZE) {

        for (let j = 0; j < m; j += BLOCK_SIZE) {

            // Iterate within each block

            for (let k = i; k < i + BLOCK_SIZE && k < n; ++k) {

                for (let l = j; l < j + BLOCK_SIZE && l < m; ++l) {

                    // Swap elements diagonally

                    let temp = A[k][l];

                    A[k][l] = A[l][k];

                    A[l][k] = temp;

                }

            }

        }

    }
}
 
// Main function

function main() {

    // Define matrix dimensions

    const n = 1024, m = 1024;
 
    // Create and initialize the matrix

    const A = Array.from({ length: n }, () => Array(m).fill(0));

    for (let i = 0; i < n; ++i) {

        for (let j = 0; j < m; ++j) {

            A[i][j] = i * n + j;

        }

    }
 
    // Call transpose function

    transpose(n, m, A);
 
    return A; // Return transposed matrix
}
main();
// Call the main functi

Advantages:
- The iterative algorithm for matrix transposition causes Ω(n²) cache misses on a n x n matrix when the matrix is stored in-row or column-major order, which has a factor of Θ(B) which evidently has more cache misses than the cache-optimal algorithm.
- Improved cache locality and memory bandwidth.
Limitation:
- Even though, optimal in the RAM model and cache-oblivious, these algorithms are not asymptotically optimal concerning cache misses.

3. Binary search tree (Divide And Conquer Algorithm):

Divide and Conquer Algorithm recursively distills the problem size. Later, the data fits into the cache(M), and eventually, the data will fit in a single block or cache line. The analyses process considers the exact minute at which a data fits into the cache and fits into a cache-line. And surprisingly it proves, the number of memory transfers is less in these cases.
- A good example of the Divide and Conquer algorithm is the binary tree.
- In the binary tree, each tree has a subtree, the left or the right, i.e., each node during recursion tree has only a single branch, most commonly known as a degenerate form of divide and conquer.
- In this scenario, the cost of each leaf is balanced by the cost of the root node, which leaves us with the same level of recursion at each node.
Advantages:
- According to Van Emde Boas’ layout, a binary search tree with the nodes labeled with certain positions needs fewer levels of recursion.
Limitation:
- The cache-oblivious data structure is attributed with the Benders set which uses the binary tree as an index data structure to efficiently find the successor element for a given value. It was found worst in both the execution time and memory usage.

4. Merge Sort:

The positioning of the data in a certain order is often called sorting. In external memory algorithms, sorting shows both lower bound and upper bound. In their paper, Aggarwal and Vitter proved a way to sort the number of memory transfers in a comparison model, i.e.,

Θ( N /B [log_M/B N/ B ])

(M/B) merge sort is the way external memory algorithm sort to attain the Aggarwal and Vitter bound. To understand the cache-oblivious context, first, we need to see the external memory algorithm. During the merging process, each block of data maintains the first B elements of each list; when a block is vacant the next block from the list is loaded into it. So it takes Θ(N/B) memory transfers for a merger to constructively scan through the memory.

The total no. of memory transfers for this kind of sorting algorithm would be: T(N) = M/B T(N/ M/B) + Θ(N/B)
The recursion tree has Θ(N/B) leaves,for a leaf cost of Θ(N/B)
The number of levels in the recursion tree is log_M/B N, so the total cost is Θ(N/B log_{M/B N/B)}

The recursion tree is nothing but a binary search tree.
Now, in the cache-oblivious conditions, the perfect algorithm to use is a classic 2-way merge sort, but then the recurrence becomes

T(N) = 2T(N/2) + Θ(N/B)

Advantages:
- Cache oblivious methods allow the use of 2-way merge sort more efficiently than external memory algorithm.
- The number of memory transfers to sort in the comparison model is Θ(N/B log_M/B N/B).
Limitation:
- Mergesort sustain Ω((n/B) lg(n/Z)) cache misses for an input size of n, which is a factor of Θ(lg Z) more cache misses than the cache-optimal algorithms. Hence, merge sort are not in itself a cache optimal algorithm.

References

Article Tags :

Algorithms

DSA

Operating Systems