Chapter 14

The Storage Pyramid

Fast storage is expensive. Slow storage is cheap. This has been an iron law of computing since the 1950s, and nothing has changed it. If a storage technology were simultaneously fast, cheap, and large, every other kind would vanish overnight. The laws of physics say that can't happen. So instead, computers are built around a pyramid: fast and expensive at the top, slow and cheap at the bottom, with each layer compensating for the weaknesses of the one below it.

Understanding the pyramid explains why CPUs need caches, why virtual memory exists, why databases manage buffer pools so carefully, and where the bottlenecks in any system actually live.

Core Concepts

The Complete Storage Pyramid

                      ┌────────┐
                      │Registers│  < 1 KB      ~0.3 ns     equivalent ~$10,000+/GB
                      └───┬────┘
                   ┌──────┴──────┐
                   │  L1 Cache   │  32–64 KB    ~1 ns       equivalent ~$1,000/GB
                   └──────┬──────┘
                ┌──────────┴──────────┐
                │     L2 Cache        │  256 KB–1 MB  ~4 ns
                └──────────┬──────────┘
             ┌─────────────┴─────────────┐
             │          L3 Cache         │  4–64 MB    ~30 ns
             └─────────────┬─────────────┘
          ┌────────────────┴────────────────┐
          │         Main RAM (DRAM)          │  8–128 GB  ~100 ns  ~$3–5/GB
          └────────────────┬────────────────┘
        ┌──────────────────┴──────────────────┐
        │            NVMe SSD                  │  500 GB–8 TB  ~0.1 ms  ~$0.10/GB
        └──────────────────┬──────────────────┘
      ┌────────────────────┴────────────────────┐
      │         HDD (spinning disk)              │  1–20 TB  ~10 ms  ~$0.02/GB
      └────────────────────┬────────────────────┘
    ┌──────────────────────┴──────────────────────┐
    │           Tape Archive                       │  100+ TB/cartridge  ~$0.002/GB
    └─────────────────────────────────────────────┘

Real Numbers for Each Layer

Registers

x86-64 has 16 general-purpose registers (RAX, RBX, …, R15), each 8 bytes
Plus SSE/AVX floating-point registers — total under 1 KB
Access: 1 CPU cycle, ~0.3 ns at 3 GHz
Wired directly to execution units; zero addressing overhead

L1 Cache

Size: 32–64 KB data + 32 KB instructions, per core
Latency: 4 cycles, ~1 ns
Intel Core i9-13900K: 48 KB D-Cache + 32 KB I-Cache per core
Technology: SRAM (static RAM) — 6 transistors per bit, ~100× more costly than DRAM per bit

L3 Cache

Apple M3 Pro: 30 MB; Intel i9-13900K: 36 MB; AMD EPYC 9654: 384 MB
Shared by all cores — acts as the communication buffer between cores
Latency ~30–40 ns; a hit here avoids a DRAM access entirely

Main RAM (DRAM)

Each cell: 1 transistor + 1 capacitor — minimal, enabling high density
Capacitors leak; cells must be refreshed periodically — hence "dynamic" RAM
DDR5-5600: ~44.8 GB/s single channel, 89.6 GB/s dual channel
Price (2024): ~$3–5 per GB

NVMe SSD

Sequential read: 3500–7000 MB/s (PCIe 4.0)
Random read IOPS: 500K–1M (4 KB, queue depth 32)
Latency: 50–100 μs — 500–1000× slower than RAM, but 100× faster than HDD
Price: ~$0.08–0.15 per GB

HDD (Hard Disk Drive)

Largest single drive (2024): 30 TB (using HAMR — Heat-Assisted Magnetic Recording)
Sequential read/write: 150–200 MB/s
Random read: ~100 IOPS (4 KB) — only 100 random accesses per second
Price: ~$0.02 per GB; unbeatable for cold, bulk data

Tape Archive

Latest generation (2024): LTO-9, 45 TB native / 90 TB compressed per cartridge
Sequential read speed: 400 MB/s — surprisingly fast, once loaded
Price: ~$0.002 per GB (1/10 of HDD)
Weakness: no random access; must spool to position, taking seconds to minutes
Used by: Amazon S3 Glacier, Google Archive, long-term scientific data archives

Why You Can't Use Just One Kind of Storage

What if you built a 256 GB computer using only SRAM (the technology of L1 cache)?

At SRAM's cost density: roughly $500,000. And if you built everything from tape, the CPU would wait minutes for each instruction fetch.

Here's a famous way to visualize the latency ratios. If one CPU cycle (0.3 ns) = 1 second of human time:

L1 Cache hit       =  1 second     ← looking up a dictionary
L2 Cache hit       =  4 seconds    ← grabbing a book from the shelf
L3 Cache hit       =  40 seconds   ← walking to the hallway bookcase
RAM access         =  3 minutes    ← walking to a nearby convenience store
SSD access         =  1.5 hours    ← taking the subway across town
HDD random access  =  16 hours     ← taking a high-speed train to another city
Tape (after load)  =  weeks        ← mailing a letter overseas

These ratios explain every architectural decision in computing: each layer exists to prevent the CPU from needing to go to the next layer down.

Locality Runs Through the Entire Pyramid

The whole pyramid works because the principle of locality holds at every level:

Registers → L1: the CPU reuses the same values in tight loops; L1 hit rate > 99%
L1 → L2 → L3: each miss falls through to the next, wider cache
L3 → RAM: only on L3 misses does the CPU wait for DRAM
RAM → SSD/HDD: only on page faults does the OS access disk (virtual memory)

Every layer is the cache for the layer below it. The entire pyramid is a recursive caching system.

Try It Yourself

Measure the latency difference between cache-friendly and cache-hostile access in Python:

import time, array, random

SIZE = 1024 * 1024   # 1M integers (~4 MB — fits in L3, not L1/L2)
data = array.array('i', range(SIZE))

N = 100_000

# Sequential access — excellent spatial locality, high cache hit rate
t = time.perf_counter()
s = 0
for i in range(N):
    s += data[i]
print(f"Sequential {N} accesses: {(time.perf_counter()-t)*1000:.1f} ms")

# Random access — poor locality, frequent cache misses
indices = random.sample(range(SIZE), N)
t = time.perf_counter()
s = 0
for i in indices:
    s += data[i]
print(f"Random     {N} accesses: {(time.perf_counter()-t)*1000:.1f} ms")
# Random is typically 5–20× slower

Inspect your machine's cache hierarchy:

# Linux
getconf -a | grep -i cache
# or browse:
ls /sys/devices/system/cpu/cpu0/cache/
cat /sys/devices/system/cpu/cpu0/cache/index0/size   # L1 data cache size

# macOS
sysctl -a | grep cache

# Memory speed
sudo dmidecode --type 17 | grep -E "Speed|Size|Type"

🔬 Going Deeper

Intel Optane: bridging the gap between RAM and SSD

Between RAM (byte-addressable, nanosecond latency, volatile) and SSD (block-addressable, microsecond latency, persistent) lies a huge gulf. Intel Optane, based on 3D XPoint technology, tried to fill it: byte-addressable, persistent, with ~300 ns latency (about 3× slower than DRAM, ~300× faster than SSD). Databases could map their entire working set directly into Optane as persistent memory—no warmup needed after a restart. Optane was discontinued in 2022 when Intel restructured, but it proved that the market for a "persistent memory" tier between RAM and SSD is real. Competing technologies (PCM, FeRAM, ReRAM) continue to be researched.

CXL: expanding memory over a network-like link

CXL (Compute Express Link) is a next-generation memory expansion protocol that runs over PCIe physical links. It allows multiple servers or accelerators to share a common pool of memory, or attach additional DRAM/flash that the CPU addresses as if it were local memory. CXL memory latency falls between 200–400 ns (2–4× local DRAM), but it dramatically expands the total memory capacity available to a machine. Meta, Microsoft, and Google are deploying CXL memory expansion in data centers to break through the memory-capacity wall for large AI model inference.

The pyramid's future: layers blurring together

The traditional boundaries between layers are becoming less sharp:

HBM (High Bandwidth Memory) stacks DRAM directly on top of the CPU or GPU die, achieving < 10 ns latency and > 1 TB/s bandwidth (used in NVIDIA A100/H100 GPUs)
Storage Class Memory (SCM) fills the space between RAM and SSDs
CXL makes memory a composable, network-attached resource
Computational storage puts processing logic inside SSDs, eliminating some data movement entirely

The pyramid won't collapse into a single layer—physics prevents it—but the distinctions between levels will keep blurring as new technologies arrive.

Where to learn more

"What Every Programmer Should Know About Memory" by Ulrich Drepper — The most comprehensive free resource spanning the entire pyramid, with measurement data and code examples at each level.
Computer Architecture: A Quantitative Approach by Hennessy & Patterson — Chapter 2 covers the memory hierarchy rigorously: cache design, replacement policies, DRAM organization, and storage hierarchy quantitative analysis.
"The Pathologies of Big Data" by Adam Jacobs (ACM Queue, 2009) — A short, insightful essay on how big data access patterns challenge the locality assumptions the pyramid is built on. Worth reading in 20 minutes.

Rate this chapter

4.6 / 5 (20 ratings)

The Storage Pyramid

The Storage Pyramid

Core Concepts

The Complete Storage Pyramid

Real Numbers for Each Layer

Why You Can't Use Just One Kind of Storage

Locality Runs Through the Entire Pyramid

Try It Yourself

🔬 Going Deeper

💬 Comments