Cache: The CPU's Cheat Sheet
Cache: The CPU's Cheat Sheet
Before an exam, you don't haul the entire library back to your dorm. You copy the most important formulas onto a single cheat sheet. Consulting that sheet is vastly faster than running back to the library every time you need a number. The CPU does exactly the same thing with memory.
CPU computation speeds have doubled roughly every few years, but main memory access speeds have improved far more slowly. A modern 3 GHz CPU can execute an addition in about 0.3 nanoseconds. Fetching a value from main RAM takes 60–100 nanoseconds—a gap of 100 to 300×. If the CPU had to wait for RAM on every operation, it would spend 99% of its time idle. Cache exists to bridge that gap.
Core Concepts
The Principle of Locality: Why Cheat Sheets Work
Cache is only useful because of a key observation about real program behavior—the principle of locality:
- Temporal locality: data accessed recently is likely to be accessed again soon. (The loop variable
iis read and written on every iteration.) - Spatial locality: when one address is accessed, nearby addresses are likely to be accessed soon too. (Array elements are contiguous in memory; after reading
a[0]you'll probably reada[1]next.)
These patterns hold across nearly all real programs. Cache exploits both: it keeps "recently used" and "nearby" data in fast storage, ready for the next access.
L1/L2/L3: Three Tiers of Cache
Modern CPUs have not one but three levels of cache, each trading speed for size:
┌──────────────────────────────────────────────────────────────┐
│ CPU Core │
│ ┌──────────────┐ │
│ │ Registers │ Capacity: < 1 KB Latency: ~0.3 ns (1 cycle) │
│ └──────┬───────┘ │
│ ┌──────▼───────┐ │
│ │ L1 Cache │ Capacity: 32-64 KB Latency: ~1 ns (4 cycles) │
│ │ (per-core) │ Usually split: I-Cache (instructions) + D-Cache (data)│
│ └──────┬───────┘ │
│ ┌──────▼───────┐ │
│ │ L2 Cache │ Capacity: 256 KB-1 MB Latency: ~4 ns (12 cycles)│
│ │ (per-core) │ │
│ └──────┬───────┘ │
└──────────┼────────────────────────────────────────────────────┘
┌──────▼───────┐
│ L3 Cache │ Capacity: 4-64 MB Latency: ~30 ns (40 cycles)
│ (shared by │ Example: Intel Core i9-13900K has 36 MB L3
│ all cores) │
└──────┬───────┘
┌──────▼───────┐
│ Main RAM │ Capacity: GB range Latency: ~60-100 ns
└──────────────┘
| Level | Typical Size | Latency | Bandwidth | Notes |
|---|---|---|---|---|
| L1 Cache | 32–64 KB | 4 cycles | ~1 TB/s | Per-core, fastest |
| L2 Cache | 256 KB–1 MB | 12 cycles | ~400 GB/s | Per-core |
| L3 Cache | 4–64 MB | 40 cycles | ~200 GB/s | Shared by all cores |
| RAM | 8–128 GB | 200+ cycles | ~50 GB/s | Slow but large |
Cache Lines: The Unit of Movement
When the CPU reads from memory, it never fetches just 1 byte. It loads the surrounding 64 bytes all at once. This 64-byte block is called a cache line.
Why? Because of spatial locality. If you access a[0], you're almost certainly about to access a[1] through a[15] (64 bytes / 4 bytes per int = 16 integers). Loading all 16 at once means the next 15 accesses come from cache instead of RAM.
Cache line (64 bytes):
Memory: 0x00 0x3F
┌──┬──┬──┬──┬──┬──┬──┬──┬── ... ──┬──┐
│ │ │ │ │ │ │ │ │ │ │
└──┴──┴──┴──┴──┴──┴──┴──┴── ... ──┴──┘
↑
CPU accesses byte at 0x04 (say, a[1]).
The entire 0x00–0x3F range is loaded into cache.
Cache Hit vs Cache Miss
- Cache hit: the requested data is already in cache. The CPU reads it in nanoseconds and moves on.
- Cache miss: the data isn't in cache. The CPU must fetch it from the next level (L2 → L3 → RAM), multiplying the latency dramatically.
Even a 1% miss rate can devastate performance. At 99% hit rate, average access time = 0.99 × 1 ns + 0.01 × 100 ns = 1.99 ns—nearly double the hit-only cost, just from that 1% misses.
Direct-Mapped vs Set-Associative
Cache is finite. How do memory addresses map to cache slots?
Direct-mapped: every memory address maps to exactly one cache slot (address mod cache size). Simple, but if two frequently used data blocks map to the same slot, they'll keep evicting each other—a pathological case called thrashing.
N-way set-associative: the cache is divided into sets, each with N slots. An address maps to one specific set but can occupy any of the N slots within it. Modern CPUs typically use 8-way L1 and 16-way L3. More ways means fewer conflicts; full associativity (any address can go anywhere) is too expensive to build at large sizes.
Direct-mapped (each address has exactly 1 possible home):
Address A → slot 3
Address B → slot 3 ← A and B fight over the same slot
4-way set-associative (each address has 4 possible homes):
Address A → set 1, any of 4 slots
Address B → set 1, any of 4 slots ← A and B can coexist
Try It Yourself
This Python experiment shows the performance gap between cache-friendly and cache-hostile access patterns:
import time
N = 4096
matrix = [[0] * N for _ in range(N)]
# Row-major (spatial locality: consecutive memory access)
start = time.time()
for i in range(N):
for j in range(N):
matrix[i][j] += 1
print(f"Row-major: {time.time() - start:.3f} s")
# Column-major (poor locality: jumps across memory on every step)
start = time.time()
for j in range(N):
for i in range(N):
matrix[i][j] += 1
print(f"Column-major: {time.time() - start:.3f} s")
# Typical result: column-major is 3–10× slower
# Reason: matrix[0][0], matrix[1][0], matrix[2][0]
# are 4096*4 = 16384 bytes apart — never in the same cache line
For a more dramatic result, try it in C:
#include <stdio.h>
#include <time.h>
#define N 4096
int matrix[N][N];
int main() {
clock_t t;
t = clock();
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
matrix[i][j]++;
printf("Row-major: %.3f s\n", (double)(clock()-t)/CLOCKS_PER_SEC);
t = clock();
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
matrix[i][j]++;
printf("Column-major: %.3f s\n", (double)(clock()-t)/CLOCKS_PER_SEC);
// Column-major is typically 5–10× slower
return 0;
}
🔬 Going Deeper
False sharing: the multicore cache trap
On a multi-core CPU, each core has its own L1 and L2 caches. But these caches operate at cache-line granularity (64 bytes). If two threads on different cores each write to different variables, but those variables happen to share a cache line—every write by one thread invalidates the other thread's cached copy of the entire line (per the MESI coherency protocol), forcing it to reload from memory. The two threads are constantly poisoning each other's cache, yet looking at the code you'd never know it. This is called false sharing, and it can reduce throughput by 10× or more. The fix is to pad frequently written variables so each occupies its own cache line. Java's @Contended annotation and Linux's ____cacheline_aligned_in_smp macro both address this exact problem.
Hardware prefetching: loading data before you ask
Modern CPUs include a hardware prefetcher that observes memory access patterns and predicts which addresses you'll need next, loading them into cache before you ask. Sequential array traversal triggers prefetching beautifully—by the time your loop reaches a[i+16], it's already in L1. Random access (pointer chasing through a linked list, hash table lookups) defeats the prefetcher completely, because it can't predict the next address. This is a major reason why linked lists underperform arrays so badly in practice, even when asymptotic complexity is the same.
Inspect your machine's cache topology
On Linux, getconf -a | grep CACHE or browsing /sys/devices/system/cpu/cpu0/cache/ shows you every level's size, type, and associativity. On macOS, sysctl -a | grep cache gives the same. Knowing your machine's exact cache parameters is the first step toward writing cache-aware code.
Where to learn more
- "What Every Programmer Should Know About Memory" by Ulrich Drepper — Section 3 is entirely dedicated to CPU cache; it includes measured benchmarks and is the most thorough free resource on the subject.
- Computer Architecture: A Quantitative Approach by Hennessy & Patterson — Appendix B covers cache design from first principles: set-associativity, replacement policies, write strategies, and prefetching.
- Systems Performance by Brendan Gregg — Chapter 3 contains real-world cache performance analysis with perf and other tools, from an engineering perspective.