Chapter 14

The Storage Pyramid

The Storage Pyramid

Fast storage is expensive. Slow storage is cheap. This has been an iron law of computing since the 1950s, and nothing has changed it. If a storage technology were simultaneously fast, cheap, and large, every other kind would vanish overnight. The laws of physics say that can't happen. So instead, computers are built around a pyramid: fast and expensive at the top, slow and cheap at the bottom, with each layer compensating for the weaknesses of the one below it.

Understanding the pyramid explains why CPUs need caches, why virtual memory exists, why databases manage buffer pools so carefully, and where the bottlenecks in any system actually live.

Core Concepts

The Complete Storage Pyramid

                      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                      โ”‚Registersโ”‚  < 1 KB      ~0.3 ns     equivalent ~$10,000+/GB
                      โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜
                   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”
                   โ”‚  L1 Cache   โ”‚  32โ€“64 KB    ~1 ns       equivalent ~$1,000/GB
                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                โ”‚     L2 Cache        โ”‚  256 KBโ€“1 MB  ~4 ns
                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
             โ”‚          L3 Cache         โ”‚  4โ€“64 MB    ~30 ns
             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
          โ”‚         Main RAM (DRAM)          โ”‚  8โ€“128 GB  ~100 ns  ~$3โ€“5/GB
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚            NVMe SSD                  โ”‚  500 GBโ€“8 TB  ~0.1 ms  ~$0.10/GB
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
      โ”‚         HDD (spinning disk)              โ”‚  1โ€“20 TB  ~10 ms  ~$0.02/GB
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚           Tape Archive                       โ”‚  100+ TB/cartridge  ~$0.002/GB
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Real Numbers for Each Layer

Registers

L1 Cache

L3 Cache

Main RAM (DRAM)

NVMe SSD

HDD (Hard Disk Drive)

Tape Archive

Why You Can't Use Just One Kind of Storage

What if you built a 256 GB computer using only SRAM (the technology of L1 cache)?

At SRAM's cost density: roughly $500,000. And if you built everything from tape, the CPU would wait minutes for each instruction fetch.

Here's a famous way to visualize the latency ratios. If one CPU cycle (0.3 ns) = 1 second of human time:

L1 Cache hit       =  1 second     โ† looking up a dictionary
L2 Cache hit       =  4 seconds    โ† grabbing a book from the shelf
L3 Cache hit       =  40 seconds   โ† walking to the hallway bookcase
RAM access         =  3 minutes    โ† walking to a nearby convenience store
SSD access         =  1.5 hours    โ† taking the subway across town
HDD random access  =  16 hours     โ† taking a high-speed train to another city
Tape (after load)  =  weeks        โ† mailing a letter overseas

These ratios explain every architectural decision in computing: each layer exists to prevent the CPU from needing to go to the next layer down.

Locality Runs Through the Entire Pyramid

The whole pyramid works because the principle of locality holds at every level:

Every layer is the cache for the layer below it. The entire pyramid is a recursive caching system.

Try It Yourself

Measure the latency difference between cache-friendly and cache-hostile access in Python:

import time, array, random

SIZE = 1024 * 1024   # 1M integers (~4 MB โ€” fits in L3, not L1/L2)
data = array.array('i', range(SIZE))

N = 100_000

# Sequential access โ€” excellent spatial locality, high cache hit rate
t = time.perf_counter()
s = 0
for i in range(N):
    s += data[i]
print(f"Sequential {N} accesses: {(time.perf_counter()-t)*1000:.1f} ms")

# Random access โ€” poor locality, frequent cache misses
indices = random.sample(range(SIZE), N)
t = time.perf_counter()
s = 0
for i in indices:
    s += data[i]
print(f"Random     {N} accesses: {(time.perf_counter()-t)*1000:.1f} ms")
# Random is typically 5โ€“20ร— slower

Inspect your machine's cache hierarchy:

# Linux
getconf -a | grep -i cache
# or browse:
ls /sys/devices/system/cpu/cpu0/cache/
cat /sys/devices/system/cpu/cpu0/cache/index0/size   # L1 data cache size

# macOS
sysctl -a | grep cache

# Memory speed
sudo dmidecode --type 17 | grep -E "Speed|Size|Type"

๐Ÿ”ฌ Going Deeper

Intel Optane: bridging the gap between RAM and SSD

Between RAM (byte-addressable, nanosecond latency, volatile) and SSD (block-addressable, microsecond latency, persistent) lies a huge gulf. Intel Optane, based on 3D XPoint technology, tried to fill it: byte-addressable, persistent, with ~300 ns latency (about 3ร— slower than DRAM, ~300ร— faster than SSD). Databases could map their entire working set directly into Optane as persistent memoryโ€”no warmup needed after a restart. Optane was discontinued in 2022 when Intel restructured, but it proved that the market for a "persistent memory" tier between RAM and SSD is real. Competing technologies (PCM, FeRAM, ReRAM) continue to be researched.

CXL: expanding memory over a network-like link

CXL (Compute Express Link) is a next-generation memory expansion protocol that runs over PCIe physical links. It allows multiple servers or accelerators to share a common pool of memory, or attach additional DRAM/flash that the CPU addresses as if it were local memory. CXL memory latency falls between 200โ€“400 ns (2โ€“4ร— local DRAM), but it dramatically expands the total memory capacity available to a machine. Meta, Microsoft, and Google are deploying CXL memory expansion in data centers to break through the memory-capacity wall for large AI model inference.

The pyramid's future: layers blurring together

The traditional boundaries between layers are becoming less sharp:

The pyramid won't collapse into a single layerโ€”physics prevents itโ€”but the distinctions between levels will keep blurring as new technologies arrive.

Where to learn more

Rate this chapter
4.6  / 5  (20 ratings)

๐Ÿ’ฌ Comments