Chapter 8

Registers: The CPU's Pockets

Registers: The CPU's Pockets

When you're working at your desk, where do you keep the things you use most? Your phone goes in your pocket, your keys on a hook nearby, your water bottle on the desk. The filing cabinet in the corner holds last year's documentsโ€”you only go there occasionally, and it takes a minute to find anything.

The CPU faces the same organizational challenge. It could store everything in main memory, but fetching data from RAM takes roughly 100 nanosecondsโ€”an eternity when the CPU can execute an instruction every 0.3 nanoseconds. So chip designers gave the CPU a few "pockets" right on the chip itself: registers. Fast enough to feed data to every instruction without ever waiting.

Core Concepts

Registers vs. Memory: The Speed Chasm

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                 Memory Hierarchy: Access Times               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Storage            โ”‚ Access Time  โ”‚ Typical Capacity         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ CPU Registers      โ”‚ ~0.3 ns      โ”‚ ~16 (a few hundred bytes)โ”‚
โ”‚ L1 Cache           โ”‚ ~1 ns        โ”‚ 32-64 KB                 โ”‚
โ”‚ L2 Cache           โ”‚ ~3-5 ns      โ”‚ 256 KB โ€“ 1 MB            โ”‚
โ”‚ L3 Cache           โ”‚ ~10-30 ns    โ”‚ 8-64 MB                  โ”‚
โ”‚ Main Memory (RAM)  โ”‚ ~60-100 ns   โ”‚ 8-64 GB                  โ”‚
โ”‚ SSD                โ”‚ ~100,000 ns  โ”‚ 256 GB โ€“ 4 TB            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Registers are roughly 300x faster than RAM. If a CPU instruction takes 1 second in your mental model, fetching a value from RAM would take 5 minutes. That's the gap that registers bridge.

The 16 General-Purpose Registers of x86-64

x86-64 has 16 general-purpose registers, each 64 bits (8 bytes) wide:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚               x86-64 General-Purpose Registers                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ RAX    โ”‚ Accumulator; holds function return values               โ”‚
โ”‚ RBX    โ”‚ Base register; callee-saved (preserved across calls)    โ”‚
โ”‚ RCX    โ”‚ Counter; used for loops and string operations           โ”‚
โ”‚ RDX    โ”‚ Data; used for multiplication/division, 3rd arg        โ”‚
โ”‚ RSI    โ”‚ Source index; string ops / 2nd function argument        โ”‚
โ”‚ RDI    โ”‚ Destination index; string ops / 1st function argument   โ”‚
โ”‚ RSP    โ”‚ Stack Pointer โ€” always points to the top of the stack   โ”‚
โ”‚ RBP    โ”‚ Base Pointer โ€” points to the bottom of the current frameโ”‚
โ”‚ R8-R15 โ”‚ General-purpose; 5th-6th args and temporaries          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

A quirk of history: x86 evolved from 8-bit to 16-bit to 32-bit to 64-bit, each generation backward-compatible with the last. So each register is actually several overlapping views of the same storage:

RAX (64-bit): โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚                      RAX                        โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
EAX (32-bit):                      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                   โ”‚             EAX             โ”‚
                                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
AX (16-bit):                                   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                               โ”‚       AX        โ”‚
                                               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
AH/AL (8-bit):                                 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                                               โ”‚  AH   โ”‚ โ”‚  AL  โ”‚
                                               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Writing to EAX clears the upper 32 bits of RAX (to zero). Writing to AX doesn't affect the upper 48 bits of RAX. This layered structure is the price of 40 years of backward compatibility. ARM has no such legacy weight.

Special-Purpose Registers

Beyond the general-purpose set, a few registers have fixed jobs:

The Instruction Pointer (RIP): Points to the address of the next instruction to execute. This is the Program Counter we covered in Chapter 6.

The Flags Register (RFLAGS): Records the status of the most recent computation. Each bit has a specific meaning:

Key bits in RFLAGS:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ZF  โ”‚ Zero Flag: last result was zero                        โ”‚
โ”‚ CF  โ”‚ Carry Flag: unsigned arithmetic produced a carry/borrowโ”‚
โ”‚ SF  โ”‚ Sign Flag: result was negative                         โ”‚
โ”‚ OF  โ”‚ Overflow Flag: signed arithmetic overflowed            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Example: after  CMP RAX, 5
  if RAX == 5 โ†’ ZF = 1
  if RAX < 5  โ†’ SF = 1 (subtraction result is negative)
  JE, JL, JG, JNZ and friends read these bits to decide where to jump.

How Compilers Allocate Registers

If registers are so fast, why not keep everything in them? Because there are only 16โ€”nowhere near enough for a real program's data. The compiler must decide which variables live in registers and which get evicted to memory. This is called register allocation, and it's one of the most important optimization problems in compiler design. Theoretically, it reduces to graph coloring, an NP-hard problem.

Here's a concrete example:

int compute(int a, int b, int c, int d) {
    int x = a + b;    // compiler assigns: x โ†’ EAX
    int y = c + d;    // compiler assigns: y โ†’ ECX
    int z = x * y;    // x's "live range" ends; EAX is reused for z
    return z;         // return value convention: result in EAX, just ret
}

The key concept is live range: a variable is "live" from the point it's assigned to the last point it's read. The shorter the live range, the easier it is to find a register for it. Scoping your variables tightly ({ int x = ...; }) isn't just styleโ€”it actively helps the compiler do a better job.

Register Spilling

What happens when there are more live variables than registers? The compiler spills some variables to the stack (memory)โ€”writing them out, freeing up the register for other use, then reading them back when needed:

; Spill example: RAX needs to be temporarily saved
sub  rsp, 8          ; allocate 8 bytes on the stack
mov  [rsp], rax      ; spill: save RAX to memory
; ... RAX is now free for other computations ...
mov  rax, [rsp]      ; reload: bring the value back
add  rsp, 8          ; release stack space

Every spill and reload is a round-trip to memoryโ€”a 300x speed penalty. Code with many spills (usually large functions with many local variables) can be measurably slower. This is one reason why breaking a huge function into smaller ones sometimes makes the whole program faster: smaller functions mean fewer live variables at once, fewer spills, and better register usage.

Try It Yourself

// test.c
int sum_of_squares(int a, int b) {
    int aa = a * a;
    int bb = b * b;
    return aa + bb;
}
gcc -O2 -S -o test.s test.c
cat test.s

# Likely output (x86-64, optimized):
# sum_of_squares:
#   imul  edi, edi          # edi = a * a  (RDI = 1st argument)
#   imul  esi, esi          # esi = b * b  (RSI = 2nd argument)
#   lea   eax, [rdi+rsi]    # eax = edi + esi
#   ret
#
# Notice: no memory operations at all!
# The variables 'aa' and 'bb' exist only briefly in registers
# and are never given addresses in memory.

The optimizer collapsed two local variables entirely. They were born in registers and died there.

๐Ÿ”ฌ Going Deeper

More registers in RISC designs

x86-64's 16 general-purpose registers are a historical constraint, not a design ideal. ARM AArch64 has 31 general-purpose registers (X0-X30). RISC-V has 32 (x0-x31). More registers means the compiler can keep more values live without spilling. This is a concrete, measurable advantage of cleaner ISA designs in compute-heavy workloadsโ€”fewer loads and stores means less time waiting on memory.

Vector registers: a parallel universe

Alongside the integer registers lives a completely separate set of vector registers. x86-64 has 16 256-bit YMM registers (AVX2) and 16 512-bit ZMM registers (AVX-512); ARM has 32 128-bit V registers (NEON/SVE). These hold multiple values packed together and support SIMD operationsโ€”doing the same arithmetic on 4, 8, or 16 numbers simultaneously. When you run a neural network locally, the vast majority of computation flows through these registers, not the general-purpose ones.

Where to learn more

Rate this chapter
4.8  / 5  (44 ratings)

๐Ÿ’ฌ Comments