Registers: The CPU's Pockets
Registers: The CPU's Pockets
When you're working at your desk, where do you keep the things you use most? Your phone goes in your pocket, your keys on a hook nearby, your water bottle on the desk. The filing cabinet in the corner holds last year's documentsโyou only go there occasionally, and it takes a minute to find anything.
The CPU faces the same organizational challenge. It could store everything in main memory, but fetching data from RAM takes roughly 100 nanosecondsโan eternity when the CPU can execute an instruction every 0.3 nanoseconds. So chip designers gave the CPU a few "pockets" right on the chip itself: registers. Fast enough to feed data to every instruction without ever waiting.
Core Concepts
Registers vs. Memory: The Speed Chasm
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Memory Hierarchy: Access Times โ
โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Storage โ Access Time โ Typical Capacity โ
โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ CPU Registers โ ~0.3 ns โ ~16 (a few hundred bytes)โ
โ L1 Cache โ ~1 ns โ 32-64 KB โ
โ L2 Cache โ ~3-5 ns โ 256 KB โ 1 MB โ
โ L3 Cache โ ~10-30 ns โ 8-64 MB โ
โ Main Memory (RAM) โ ~60-100 ns โ 8-64 GB โ
โ SSD โ ~100,000 ns โ 256 GB โ 4 TB โ
โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Registers are roughly 300x faster than RAM. If a CPU instruction takes 1 second in your mental model, fetching a value from RAM would take 5 minutes. That's the gap that registers bridge.
The 16 General-Purpose Registers of x86-64
x86-64 has 16 general-purpose registers, each 64 bits (8 bytes) wide:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ x86-64 General-Purpose Registers โ
โโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ RAX โ Accumulator; holds function return values โ
โ RBX โ Base register; callee-saved (preserved across calls) โ
โ RCX โ Counter; used for loops and string operations โ
โ RDX โ Data; used for multiplication/division, 3rd arg โ
โ RSI โ Source index; string ops / 2nd function argument โ
โ RDI โ Destination index; string ops / 1st function argument โ
โ RSP โ Stack Pointer โ always points to the top of the stack โ
โ RBP โ Base Pointer โ points to the bottom of the current frameโ
โ R8-R15 โ General-purpose; 5th-6th args and temporaries โ
โโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
A quirk of history: x86 evolved from 8-bit to 16-bit to 32-bit to 64-bit, each generation backward-compatible with the last. So each register is actually several overlapping views of the same storage:
RAX (64-bit): โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RAX โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
EAX (32-bit): โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EAX โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
AX (16-bit): โโโโโโโโโโโโโโโโโโโ
โ AX โ
โโโโโโโโโโโโโโโโโโโ
AH/AL (8-bit): โโโโโโโโโ โโโโโโโโโ
โ AH โ โ AL โ
โโโโโโโโโ โโโโโโโโโ
Writing to EAX clears the upper 32 bits of RAX (to zero). Writing to AX doesn't affect the upper 48 bits of RAX. This layered structure is the price of 40 years of backward compatibility. ARM has no such legacy weight.
Special-Purpose Registers
Beyond the general-purpose set, a few registers have fixed jobs:
The Instruction Pointer (RIP): Points to the address of the next instruction to execute. This is the Program Counter we covered in Chapter 6.
The Flags Register (RFLAGS): Records the status of the most recent computation. Each bit has a specific meaning:
Key bits in RFLAGS:
โโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ZF โ Zero Flag: last result was zero โ
โ CF โ Carry Flag: unsigned arithmetic produced a carry/borrowโ
โ SF โ Sign Flag: result was negative โ
โ OF โ Overflow Flag: signed arithmetic overflowed โ
โโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Example: after CMP RAX, 5
if RAX == 5 โ ZF = 1
if RAX < 5 โ SF = 1 (subtraction result is negative)
JE, JL, JG, JNZ and friends read these bits to decide where to jump.
How Compilers Allocate Registers
If registers are so fast, why not keep everything in them? Because there are only 16โnowhere near enough for a real program's data. The compiler must decide which variables live in registers and which get evicted to memory. This is called register allocation, and it's one of the most important optimization problems in compiler design. Theoretically, it reduces to graph coloring, an NP-hard problem.
Here's a concrete example:
int compute(int a, int b, int c, int d) {
int x = a + b; // compiler assigns: x โ EAX
int y = c + d; // compiler assigns: y โ ECX
int z = x * y; // x's "live range" ends; EAX is reused for z
return z; // return value convention: result in EAX, just ret
}
The key concept is live range: a variable is "live" from the point it's assigned to the last point it's read. The shorter the live range, the easier it is to find a register for it. Scoping your variables tightly ({ int x = ...; }) isn't just styleโit actively helps the compiler do a better job.
Register Spilling
What happens when there are more live variables than registers? The compiler spills some variables to the stack (memory)โwriting them out, freeing up the register for other use, then reading them back when needed:
; Spill example: RAX needs to be temporarily saved
sub rsp, 8 ; allocate 8 bytes on the stack
mov [rsp], rax ; spill: save RAX to memory
; ... RAX is now free for other computations ...
mov rax, [rsp] ; reload: bring the value back
add rsp, 8 ; release stack space
Every spill and reload is a round-trip to memoryโa 300x speed penalty. Code with many spills (usually large functions with many local variables) can be measurably slower. This is one reason why breaking a huge function into smaller ones sometimes makes the whole program faster: smaller functions mean fewer live variables at once, fewer spills, and better register usage.
Try It Yourself
// test.c
int sum_of_squares(int a, int b) {
int aa = a * a;
int bb = b * b;
return aa + bb;
}
gcc -O2 -S -o test.s test.c
cat test.s
# Likely output (x86-64, optimized):
# sum_of_squares:
# imul edi, edi # edi = a * a (RDI = 1st argument)
# imul esi, esi # esi = b * b (RSI = 2nd argument)
# lea eax, [rdi+rsi] # eax = edi + esi
# ret
#
# Notice: no memory operations at all!
# The variables 'aa' and 'bb' exist only briefly in registers
# and are never given addresses in memory.
The optimizer collapsed two local variables entirely. They were born in registers and died there.
๐ฌ Going Deeper
More registers in RISC designs
x86-64's 16 general-purpose registers are a historical constraint, not a design ideal. ARM AArch64 has 31 general-purpose registers (X0-X30). RISC-V has 32 (x0-x31). More registers means the compiler can keep more values live without spilling. This is a concrete, measurable advantage of cleaner ISA designs in compute-heavy workloadsโfewer loads and stores means less time waiting on memory.
Vector registers: a parallel universe
Alongside the integer registers lives a completely separate set of vector registers. x86-64 has 16 256-bit YMM registers (AVX2) and 16 512-bit ZMM registers (AVX-512); ARM has 32 128-bit V registers (NEON/SVE). These hold multiple values packed together and support SIMD operationsโdoing the same arithmetic on 4, 8, or 16 numbers simultaneously. When you run a neural network locally, the vast majority of computation flows through these registers, not the general-purpose ones.
Where to learn more
- Computer Systems: A Programmer's Perspective (CSAPP) โ Chapter 3 is the most readable treatment of x86-64 register conventions for programmers; essential if you want to read compiler output.
- Compilers: Principles, Techniques, and Tools (the Dragon Book, Aho et al.) โ Chapter 8 covers register allocation via graph coloring in full theoretical detail.
- Intel 64 and IA-32 Architectures Software Developer's Manual โ The authoritative reference for every register's behavior, freely available from Intel's website. Use it as a dictionary.