The Storage Pyramid
The Storage Pyramid
Fast storage is expensive. Slow storage is cheap. This has been an iron law of computing since the 1950s, and nothing has changed it. If a storage technology were simultaneously fast, cheap, and large, every other kind would vanish overnight. The laws of physics say that can't happen. So instead, computers are built around a pyramid: fast and expensive at the top, slow and cheap at the bottom, with each layer compensating for the weaknesses of the one below it.
Understanding the pyramid explains why CPUs need caches, why virtual memory exists, why databases manage buffer pools so carefully, and where the bottlenecks in any system actually live.
Core Concepts
The Complete Storage Pyramid
┌────────┐
│Registers│ < 1 KB ~0.3 ns equivalent ~$10,000+/GB
└───┬────┘
┌──────┴──────┐
│ L1 Cache │ 32–64 KB ~1 ns equivalent ~$1,000/GB
└──────┬──────┘
┌──────────┴──────────┐
│ L2 Cache │ 256 KB–1 MB ~4 ns
└──────────┬──────────┘
┌─────────────┴─────────────┐
│ L3 Cache │ 4–64 MB ~30 ns
└─────────────┬─────────────┘
┌────────────────┴────────────────┐
│ Main RAM (DRAM) │ 8–128 GB ~100 ns ~$3–5/GB
└────────────────┬────────────────┘
┌──────────────────┴──────────────────┐
│ NVMe SSD │ 500 GB–8 TB ~0.1 ms ~$0.10/GB
└──────────────────┬──────────────────┘
┌────────────────────┴────────────────────┐
│ HDD (spinning disk) │ 1–20 TB ~10 ms ~$0.02/GB
└────────────────────┬────────────────────┘
┌──────────────────────┴──────────────────────┐
│ Tape Archive │ 100+ TB/cartridge ~$0.002/GB
└─────────────────────────────────────────────┘
Real Numbers for Each Layer
Registers
- x86-64 has 16 general-purpose registers (RAX, RBX, …, R15), each 8 bytes
- Plus SSE/AVX floating-point registers — total under 1 KB
- Access: 1 CPU cycle, ~0.3 ns at 3 GHz
- Wired directly to execution units; zero addressing overhead
L1 Cache
- Size: 32–64 KB data + 32 KB instructions, per core
- Latency: 4 cycles, ~1 ns
- Intel Core i9-13900K: 48 KB D-Cache + 32 KB I-Cache per core
- Technology: SRAM (static RAM) — 6 transistors per bit, ~100× more costly than DRAM per bit
L3 Cache
- Apple M3 Pro: 30 MB; Intel i9-13900K: 36 MB; AMD EPYC 9654: 384 MB
- Shared by all cores — acts as the communication buffer between cores
- Latency ~30–40 ns; a hit here avoids a DRAM access entirely
Main RAM (DRAM)
- Each cell: 1 transistor + 1 capacitor — minimal, enabling high density
- Capacitors leak; cells must be refreshed periodically — hence "dynamic" RAM
- DDR5-5600: ~44.8 GB/s single channel, 89.6 GB/s dual channel
- Price (2024): ~$3–5 per GB
NVMe SSD
- Sequential read: 3500–7000 MB/s (PCIe 4.0)
- Random read IOPS: 500K–1M (4 KB, queue depth 32)
- Latency: 50–100 μs — 500–1000× slower than RAM, but 100× faster than HDD
- Price: ~$0.08–0.15 per GB
HDD (Hard Disk Drive)
- Largest single drive (2024): 30 TB (using HAMR — Heat-Assisted Magnetic Recording)
- Sequential read/write: 150–200 MB/s
- Random read: ~100 IOPS (4 KB) — only 100 random accesses per second
- Price: ~$0.02 per GB; unbeatable for cold, bulk data
Tape Archive
- Latest generation (2024): LTO-9, 45 TB native / 90 TB compressed per cartridge
- Sequential read speed: 400 MB/s — surprisingly fast, once loaded
- Price: ~$0.002 per GB (1/10 of HDD)
- Weakness: no random access; must spool to position, taking seconds to minutes
- Used by: Amazon S3 Glacier, Google Archive, long-term scientific data archives
Why You Can't Use Just One Kind of Storage
What if you built a 256 GB computer using only SRAM (the technology of L1 cache)?
At SRAM's cost density: roughly $500,000. And if you built everything from tape, the CPU would wait minutes for each instruction fetch.
Here's a famous way to visualize the latency ratios. If one CPU cycle (0.3 ns) = 1 second of human time:
L1 Cache hit = 1 second ← looking up a dictionary
L2 Cache hit = 4 seconds ← grabbing a book from the shelf
L3 Cache hit = 40 seconds ← walking to the hallway bookcase
RAM access = 3 minutes ← walking to a nearby convenience store
SSD access = 1.5 hours ← taking the subway across town
HDD random access = 16 hours ← taking a high-speed train to another city
Tape (after load) = weeks ← mailing a letter overseas
These ratios explain every architectural decision in computing: each layer exists to prevent the CPU from needing to go to the next layer down.
Locality Runs Through the Entire Pyramid
The whole pyramid works because the principle of locality holds at every level:
- Registers → L1: the CPU reuses the same values in tight loops; L1 hit rate > 99%
- L1 → L2 → L3: each miss falls through to the next, wider cache
- L3 → RAM: only on L3 misses does the CPU wait for DRAM
- RAM → SSD/HDD: only on page faults does the OS access disk (virtual memory)
Every layer is the cache for the layer below it. The entire pyramid is a recursive caching system.
Try It Yourself
Measure the latency difference between cache-friendly and cache-hostile access in Python:
import time, array, random
SIZE = 1024 * 1024 # 1M integers (~4 MB — fits in L3, not L1/L2)
data = array.array('i', range(SIZE))
N = 100_000
# Sequential access — excellent spatial locality, high cache hit rate
t = time.perf_counter()
s = 0
for i in range(N):
s += data[i]
print(f"Sequential {N} accesses: {(time.perf_counter()-t)*1000:.1f} ms")
# Random access — poor locality, frequent cache misses
indices = random.sample(range(SIZE), N)
t = time.perf_counter()
s = 0
for i in indices:
s += data[i]
print(f"Random {N} accesses: {(time.perf_counter()-t)*1000:.1f} ms")
# Random is typically 5–20× slower
Inspect your machine's cache hierarchy:
# Linux
getconf -a | grep -i cache
# or browse:
ls /sys/devices/system/cpu/cpu0/cache/
cat /sys/devices/system/cpu/cpu0/cache/index0/size # L1 data cache size
# macOS
sysctl -a | grep cache
# Memory speed
sudo dmidecode --type 17 | grep -E "Speed|Size|Type"
🔬 Going Deeper
Intel Optane: bridging the gap between RAM and SSD
Between RAM (byte-addressable, nanosecond latency, volatile) and SSD (block-addressable, microsecond latency, persistent) lies a huge gulf. Intel Optane, based on 3D XPoint technology, tried to fill it: byte-addressable, persistent, with ~300 ns latency (about 3× slower than DRAM, ~300× faster than SSD). Databases could map their entire working set directly into Optane as persistent memory—no warmup needed after a restart. Optane was discontinued in 2022 when Intel restructured, but it proved that the market for a "persistent memory" tier between RAM and SSD is real. Competing technologies (PCM, FeRAM, ReRAM) continue to be researched.
CXL: expanding memory over a network-like link
CXL (Compute Express Link) is a next-generation memory expansion protocol that runs over PCIe physical links. It allows multiple servers or accelerators to share a common pool of memory, or attach additional DRAM/flash that the CPU addresses as if it were local memory. CXL memory latency falls between 200–400 ns (2–4× local DRAM), but it dramatically expands the total memory capacity available to a machine. Meta, Microsoft, and Google are deploying CXL memory expansion in data centers to break through the memory-capacity wall for large AI model inference.
The pyramid's future: layers blurring together
The traditional boundaries between layers are becoming less sharp:
- HBM (High Bandwidth Memory) stacks DRAM directly on top of the CPU or GPU die, achieving < 10 ns latency and > 1 TB/s bandwidth (used in NVIDIA A100/H100 GPUs)
- Storage Class Memory (SCM) fills the space between RAM and SSDs
- CXL makes memory a composable, network-attached resource
- Computational storage puts processing logic inside SSDs, eliminating some data movement entirely
The pyramid won't collapse into a single layer—physics prevents it—but the distinctions between levels will keep blurring as new technologies arrive.
Where to learn more
- "What Every Programmer Should Know About Memory" by Ulrich Drepper — The most comprehensive free resource spanning the entire pyramid, with measurement data and code examples at each level.
- Computer Architecture: A Quantitative Approach by Hennessy & Patterson — Chapter 2 covers the memory hierarchy rigorously: cache design, replacement policies, DRAM organization, and storage hierarchy quantitative analysis.
- "The Pathologies of Big Data" by Adam Jacobs (ACM Queue, 2009) — A short, insightful essay on how big data access patterns challenge the locality assumptions the pyramid is built on. Worth reading in 20 minutes.