Cache-Friendly Code
Cache-Friendly Code
A university library holds tens of millions of books, but your desk can hold maybe ten. The smart strategy is to grab a whole armful of related books at once, work through them for hours at your desk, then go back to the shelves for the next batch. Running to the shelves every time you need a single page would be madness.
The CPU cache is your desk. Main memory is the shelves. Cache capacity is tiny compared to RAM, but it's 10–100× faster. Writing cache-friendly code means helping the CPU keep its desk stocked with data it's about to need — instead of running to the shelves for every access.
Core Concepts
The Cache Line: The Unit of Data Movement
When the CPU fetches data from memory, it doesn't retrieve a single byte — it loads an entire cache line, typically 64 bytes. Read arr[0] and you automatically get arr[1] through arr[15] in the same load.
This creates a simple rule: sequential access is essentially free (subsequent elements are already in cache), but strided or random access repeatedly triggers new cache line loads, each costing 50–200 ns.
Memory layout (64-byte cache line):
┌──────────────────────────────────────────────────┐
│ arr[0] arr[1] arr[2] ... arr[15] │ ← one cache line load
└──────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────┐
│ arr[16] arr[17] ... arr[31] │ ← loaded only if accessed
└──────────────────────────────────────────────────┘
Matrix Traversal: The Classic Cache Miss Case Study
C stores 2D arrays in row-major order: a[i][j] and a[i][j+1] are adjacent in memory, but a[i][j] and a[i+1][j] are an entire row apart.
// Pattern A: row-major traversal (cache-friendly)
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
sum += a[i][j]; // sequential access — every loaded byte gets used
// Pattern B: column-major traversal (cache-hostile)
for (j = 0; j < N; j++)
for (i = 0; i < N; i++)
sum += a[i][j]; // stride = N elements — cache lines almost entirely wasted
For a 4096×4096 matrix, the performance difference is 5–10×. Matrix transposition (b[j][i] = a[i][j]) is the most common victim: reading a is row-major (friendly), but writing b is column-major (hostile).
Struct Field Layout Affects Performance
The order of fields in a struct determines memory layout, which affects how efficiently cache lines are used:
// Poor layout: frequently-used fields spread across cache lines
struct Bad {
char flag; // 1 byte
// 7 bytes padding
double value; // 8 bytes
char active; // 1 byte
// 7 bytes padding
double score; // 8 bytes
// total: 32 bytes, but only 18 bytes of data
};
// Better layout: largest fields first, minimal padding
struct Good {
double value; // 8 bytes
double score; // 8 bytes
char flag; // 1 byte
char active; // 1 byte
// 6 bytes padding
// total: 24 bytes — saves 8 bytes per instance
};
More importantly, consider Data-Oriented Design: if a hot loop only reads the score field from an array of objects, storing all scores in a separate flat array is dramatically more cache-efficient than packing them inside structs (struct-of-arrays vs. array-of-structs). With structs, every access loads an entire cache line but uses only a few bytes of it.
Loop Tiling (Blocking)
Matrix multiplication — C[i][j] += A[i][k] * B[k][j] — accesses B in column-major order. For large N, the B matrix doesn't fit in cache, and every access to B[k][j] misses.
Tiling divides matrices into sub-blocks sized to fit in L1 or L2 cache. Each tile of B is loaded once and reused many times:
Without tiling (N=1024, float):
B matrix = 4 MB — far exceeds L1/L2
Every access to B is a cache miss
With tiling (tile size = 64):
B sub-tile = 64×64×4 = 16 KB — fits in L1 (usually 32 KB)
B tile data stays hot in cache throughout the tile computation
Result: 10–100× reduction in memory traffic for B accesses
┌──────────────────┐ ┌──────────────────┐
│ │ │ ┌────┐ │
│ A │ │ │tile│ B │
│ │ × │ └────┘ │
│ ← row-major → │ │ │
└──────────────────┘ └──────────────────┘
↑
This tile stays in cache
until fully consumed
Hands-On Verification
This program compares row-major versus column-major traversal of a large matrix:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define N 4096
static float mat[N][N];
double now_ms() {
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, &t);
return t.tv_sec * 1e3 + t.tv_nsec / 1e6;
}
int main() {
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
mat[i][j] = (float)(i + j);
volatile double sum;
// Row-major: cache-friendly
double t0 = now_ms();
sum = 0;
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += mat[i][j];
double row_ms = now_ms() - t0;
// Column-major: cache-hostile
t0 = now_ms();
sum = 0;
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += mat[i][j];
double col_ms = now_ms() - t0;
printf("Row-major: %.1f ms\n", row_ms);
printf("Column-major: %.1f ms\n", col_ms);
printf("Slowdown: %.1fx\n", col_ms / row_ms);
return 0;
}
gcc -O2 -o cache cache.c && ./cache
Typical output:
Row-major: 18.3 ms
Column-major: 112.7 ms
Slowdown: 6.2x
To count actual cache misses, use Valgrind's cachegrind tool:
valgrind --tool=cachegrind --cache-sim=yes ./cache
cg_annotate cachegrind.out.<pid>
Or with Linux perf:
perf stat -e cache-references,cache-misses ./cache
🔬 Going Deeper
Cache miss latency numbers (typical 2024 x86):
L1 cache hit: ~4 cycles (~1.3 ns at 3 GHz)
L2 cache hit: ~12 cycles (~4 ns)
L3 cache hit: ~40 cycles (~13 ns)
Main memory: ~200 cycles (~65 ns)
An L1 miss that falls through to main memory costs 50× more than an L1 hit. In a loop where every iteration misses L1, performance is limited by memory bandwidth, not CPU compute. On a modern 3 GHz core, you can execute roughly 3 billion additions per second — but only fetch about 20 GB/s from a single memory channel, enough for about 5 billion floats per second. If your loop misses on every element, you're memory-bound regardless of how fast the ALU is.
SIMD and cache line alignment combine naturally. SIMD instructions (SSE/AVX) process 128–512 bits per operation. If a SIMD load spans two cache lines — because the data isn't aligned to a 64-byte boundary — the CPU must fetch both lines, doubling memory traffic. Aligning buffers with alignas(64) or posix_memalign(..., 64, size) eliminates this. When using AVX-512 on data that isn't 64-byte aligned, you can easily lose 10–20% of peak throughput purely from alignment penalties.
Hardware and software prefetching work together to hide memory latency. The CPU's hardware prefetcher detects sequential and strided access patterns and loads future cache lines speculatively. For access patterns too complex for the hardware to predict — graph traversals, pointer-chasing linked structures, irregular sparse matrix accesses — manual prefetching with __builtin_prefetch(addr, 0, 1) can issue the load tens of cycles early, overlapping memory latency with computation. In matrix operations and image processing kernels, manual prefetching 2–4 iterations ahead often yields 30–50% speedup.
For deep reading: Ulrich Drepper's "What Every Programmer Should Know About Memory" (2007, available free online) is the single most comprehensive treatment of this topic — it covers cache hierarchy, TLBs, NUMA, and prefetching in exhaustive detail. Brendan Gregg's Systems Performance: Enterprise and the Cloud Chapter 3 ("Memory") demonstrates cache profiling with perf, pcstat, and valgrind in real production scenarios. Mike Acton's CppCon 2014 talk "Data-Oriented Design and C++" is essential viewing for the engineering mindset: data layout is more important than algorithms when working at performance limits. For the theoretical foundations of memory hierarchy design, Computer Architecture: A Quantitative Approach by Hennessy and Patterson, Appendix B, covers cache organization, associativity, replacement policies, and miss rate analysis with mathematical rigor.