Chapter 11

Cache: The CPU's Cheat Sheet

Cache: The CPU's Cheat Sheet

Before an exam, you don't haul the entire library back to your dorm. You copy the most important formulas onto a single cheat sheet. Consulting that sheet is vastly faster than running back to the library every time you need a number. The CPU does exactly the same thing with memory.

CPU computation speeds have doubled roughly every few years, but main memory access speeds have improved far more slowly. A modern 3 GHz CPU can execute an addition in about 0.3 nanoseconds. Fetching a value from main RAM takes 60–100 nanoseconds—a gap of 100 to 300×. If the CPU had to wait for RAM on every operation, it would spend 99% of its time idle. Cache exists to bridge that gap.

Core Concepts

The Principle of Locality: Why Cheat Sheets Work

Cache is only useful because of a key observation about real program behavior—the principle of locality:

These patterns hold across nearly all real programs. Cache exploits both: it keeps "recently used" and "nearby" data in fast storage, ready for the next access.

L1/L2/L3: Three Tiers of Cache

Modern CPUs have not one but three levels of cache, each trading speed for size:

┌──────────────────────────────────────────────────────────────┐
│                    CPU Core                                   │
│   ┌──────────────┐                                           │
│   │   Registers  │  Capacity: < 1 KB   Latency: ~0.3 ns (1 cycle)  │
│   └──────┬───────┘                                           │
│   ┌──────▼───────┐                                           │
│   │   L1 Cache   │  Capacity: 32-64 KB  Latency: ~1 ns (4 cycles)  │
│   │ (per-core)   │  Usually split: I-Cache (instructions) + D-Cache (data)│
│   └──────┬───────┘                                           │
│   ┌──────▼───────┐                                           │
│   │   L2 Cache   │  Capacity: 256 KB-1 MB  Latency: ~4 ns (12 cycles)│
│   │ (per-core)   │                                           │
│   └──────┬───────┘                                           │
└──────────┼────────────────────────────────────────────────────┘
    ┌──────▼───────┐
    │   L3 Cache   │  Capacity: 4-64 MB  Latency: ~30 ns (40 cycles)
    │ (shared by   │  Example: Intel Core i9-13900K has 36 MB L3
    │  all cores)  │
    └──────┬───────┘
    ┌──────▼───────┐
    │   Main RAM   │  Capacity: GB range  Latency: ~60-100 ns
    └──────────────┘
Level Typical Size Latency Bandwidth Notes
L1 Cache 32–64 KB 4 cycles ~1 TB/s Per-core, fastest
L2 Cache 256 KB–1 MB 12 cycles ~400 GB/s Per-core
L3 Cache 4–64 MB 40 cycles ~200 GB/s Shared by all cores
RAM 8–128 GB 200+ cycles ~50 GB/s Slow but large

Cache Lines: The Unit of Movement

When the CPU reads from memory, it never fetches just 1 byte. It loads the surrounding 64 bytes all at once. This 64-byte block is called a cache line.

Why? Because of spatial locality. If you access a[0], you're almost certainly about to access a[1] through a[15] (64 bytes / 4 bytes per int = 16 integers). Loading all 16 at once means the next 15 accesses come from cache instead of RAM.

Cache line (64 bytes):

Memory:  0x00                                    0x3F
         ┌──┬──┬──┬──┬──┬──┬──┬──┬── ... ──┬──┐
         │  │  │  │  │  │  │  │  │           │  │
         └──┴──┴──┴──┴──┴──┴──┴──┴── ... ──┴──┘
               ↑
         CPU accesses byte at 0x04 (say, a[1]).
         The entire 0x00–0x3F range is loaded into cache.

Cache Hit vs Cache Miss

Even a 1% miss rate can devastate performance. At 99% hit rate, average access time = 0.99 × 1 ns + 0.01 × 100 ns = 1.99 ns—nearly double the hit-only cost, just from that 1% misses.

Direct-Mapped vs Set-Associative

Cache is finite. How do memory addresses map to cache slots?

Direct-mapped: every memory address maps to exactly one cache slot (address mod cache size). Simple, but if two frequently used data blocks map to the same slot, they'll keep evicting each other—a pathological case called thrashing.

N-way set-associative: the cache is divided into sets, each with N slots. An address maps to one specific set but can occupy any of the N slots within it. Modern CPUs typically use 8-way L1 and 16-way L3. More ways means fewer conflicts; full associativity (any address can go anywhere) is too expensive to build at large sizes.

Direct-mapped (each address has exactly 1 possible home):
Address A → slot 3
Address B → slot 3   ← A and B fight over the same slot

4-way set-associative (each address has 4 possible homes):
Address A → set 1, any of 4 slots
Address B → set 1, any of 4 slots   ← A and B can coexist

Try It Yourself

This Python experiment shows the performance gap between cache-friendly and cache-hostile access patterns:

import time

N = 4096
matrix = [[0] * N for _ in range(N)]

# Row-major (spatial locality: consecutive memory access)
start = time.time()
for i in range(N):
    for j in range(N):
        matrix[i][j] += 1
print(f"Row-major:    {time.time() - start:.3f} s")

# Column-major (poor locality: jumps across memory on every step)
start = time.time()
for j in range(N):
    for i in range(N):
        matrix[i][j] += 1
print(f"Column-major: {time.time() - start:.3f} s")

# Typical result: column-major is 3–10× slower
# Reason: matrix[0][0], matrix[1][0], matrix[2][0]
# are 4096*4 = 16384 bytes apart — never in the same cache line

For a more dramatic result, try it in C:

#include <stdio.h>
#include <time.h>
#define N 4096

int matrix[N][N];

int main() {
    clock_t t;

    t = clock();
    for (int i = 0; i < N; i++)
        for (int j = 0; j < N; j++)
            matrix[i][j]++;
    printf("Row-major:    %.3f s\n", (double)(clock()-t)/CLOCKS_PER_SEC);

    t = clock();
    for (int j = 0; j < N; j++)
        for (int i = 0; i < N; i++)
            matrix[i][j]++;
    printf("Column-major: %.3f s\n", (double)(clock()-t)/CLOCKS_PER_SEC);
    // Column-major is typically 5–10× slower
    return 0;
}

🔬 Going Deeper

False sharing: the multicore cache trap

On a multi-core CPU, each core has its own L1 and L2 caches. But these caches operate at cache-line granularity (64 bytes). If two threads on different cores each write to different variables, but those variables happen to share a cache line—every write by one thread invalidates the other thread's cached copy of the entire line (per the MESI coherency protocol), forcing it to reload from memory. The two threads are constantly poisoning each other's cache, yet looking at the code you'd never know it. This is called false sharing, and it can reduce throughput by 10× or more. The fix is to pad frequently written variables so each occupies its own cache line. Java's @Contended annotation and Linux's ____cacheline_aligned_in_smp macro both address this exact problem.

Hardware prefetching: loading data before you ask

Modern CPUs include a hardware prefetcher that observes memory access patterns and predicts which addresses you'll need next, loading them into cache before you ask. Sequential array traversal triggers prefetching beautifully—by the time your loop reaches a[i+16], it's already in L1. Random access (pointer chasing through a linked list, hash table lookups) defeats the prefetcher completely, because it can't predict the next address. This is a major reason why linked lists underperform arrays so badly in practice, even when asymptotic complexity is the same.

Inspect your machine's cache topology

On Linux, getconf -a | grep CACHE or browsing /sys/devices/system/cpu/cpu0/cache/ shows you every level's size, type, and associativity. On macOS, sysctl -a | grep cache gives the same. Knowing your machine's exact cache parameters is the first step toward writing cache-aware code.

Where to learn more

Rate this chapter
4.8  / 5  (30 ratings)

💬 Comments