Chapter 6

The Life of an Instruction

The Life of an Instruction

Imagine ordering a steak at a restaurant. From the moment you speak to the waiter until the plate arrives at your table, a sequence of handoffs happens: the waiter writes your order, runs it to the kitchen, a chef reads the ticket, prepares ingredients, cooks the steak, plates it, and hands it off to the runner. Each station has a dedicated job. Nothing is redundant; everything is specialized.

A CPU executing an instruction works the same way. Computer scientists call this the instruction cycle—a fixed sequence of stages that every instruction passes through, from birth to completion. The classic version has five stages: Fetch, Decode, Execute, Memory Access, and Write Back. Every instruction lives and dies in these five workshops.

Core Concepts

The Five Stages

Let's follow a single concrete instruction through its entire life: ADD R1, R2, R3 — which means "add the values in R2 and R3, and store the result in R1."

┌─────────────────────────────────────────────────────────────────┐
│               The Five Stages of Instruction Execution          │
├───────────────┬─────────────────────────────────────────────────┤
│  Stage        │  What Happens                                   │
├───────────────┼─────────────────────────────────────────────────┤
│ IF  Fetch     │ Use PC to retrieve the instruction from memory  │
│ ID  Decode    │ Parse the binary; figure out op and operands    │
│ EX  Execute   │ The ALU adds R2 and R3 together                 │
│ MEM Memory    │ Read/write main memory if needed (ADD skips)    │
│ WB  Write Back│ Write the result into the destination reg (R1)  │
└───────────────┴─────────────────────────────────────────────────┘

Stage 1 — Instruction Fetch (IF)

The CPU contains a special register called the Program Counter (PC). Its only job is to remember the address of the next instruction in memory. During IF, the CPU uses PC as an address, reaches into memory, and pulls out the instruction bytes, loading them into the Instruction Register (IR).

Once fetched, the PC automatically advances by the instruction size (4 bytes for ARM, variable for x86). Think of PC as a bookmark that always points to the next thing to read.

Stage 2 — Instruction Decode (ID)

The IR now holds raw binary. For example, in a MIPS-style encoding:

00000000001100100000000010000000

The decoder breaks it apart:

Opcode  000000  → arithmetic instruction
RS      00011   → first source: register R3
RT      00010   → second source: register R2
RD      00001   → destination: register R1
Shamt   00000   → shift amount: 0
Funct   100000  → specific operation: ADD

Simultaneously, the decoder reads the actual current values out of R2 and R3 from the register file, so they're ready for the next stage.

Stage 3 — Execute (EX)

The Arithmetic Logic Unit (ALU) takes the two values that were read in the decode stage and performs the actual computation:

R2 = 7
R3 = 5
            ┌─────┐
  7  ──────►│     │
            │ ALU │──► 12
  5  ──────►│     │
            └─────┘

For jump instructions, the EX stage instead computes the branch target address and updates the PC directly. The ALU is the muscle; IF and ID just set it up.

Stage 4 — Memory Access (MEM)

If the instruction needs to interact with main memory—a load like LW R1, 100(R2) or a store like SW R1, 100(R2)—that happens here. This stage talks directly to the memory subsystem.

Our ADD R1, R2, R3 doesn't touch memory at all, so this stage is a pass-through: the result just flows forward to the final stage.

Stage 5 — Write Back (WB)

The result is written back to its destination register. For our instruction, that means placing the value 12 into R1.

R1 ← 12   ✓  An instruction's life is complete.

Full Flow Diagram

Memory                          CPU
┌──────────────┐   PC          ┌──────────────────────────────────────┐
│ addr 0x1000  │◄──────────────│  Program Counter  PC = 0x1000        │
│ ADD R1,R2,R3 │               │                                      │
│ addr 0x1004  │   IF          │  1. Fetch: retrieve from 0x1000      │
│ ...          │──────────────►│     IR ← mem[PC]                     │
└──────────────┘               │     PC ← PC + 4  (now 0x1004)       │
                               │                                      │
                               │  2. Decode: parse IR, read R2, R3   │
                               │     op=ADD  rd=R1  rs=R2  rt=R3     │
                               │                                      │
                               │  3. Execute: ALU computes R2 + R3   │
                               │     result = 7 + 5 = 12             │
                               │                                      │
                               │  4. Memory: no memory op (pass)     │
                               │                                      │
                               │  5. Write Back: R1 ← 12             │
                               └──────────────────────────────────────┘

Try It Yourself

Here's a toy Python simulation of the five-stage cycle:

# A minimal CPU simulator: 4 registers, only ADD instruction
registers = [0, 7, 5, 0]   # R0=0, R1=7, R2=5, R3=0
memory = [
    ("ADD", 3, 1, 2),      # ADD R3, R1, R2  → R3 = R1 + R2
    ("ADD", 0, 3, 1),      # ADD R0, R3, R1  → R0 = R3 + R1
]
PC = 0

while PC < len(memory):
    # Stage 1: Fetch
    instruction = memory[PC]
    PC += 1
    print(f"IF:  fetched {instruction}")

    # Stage 2: Decode
    op, rd, rs, rt = instruction
    a, b = registers[rs], registers[rt]
    print(f"ID:  op={op}  src={a},{b}  dest=R{rd}")

    # Stage 3: Execute
    if op == "ADD":
        result = a + b
    print(f"EX:  result = {result}")

    # Stage 4: Memory (nothing to do)
    print(f"MEM: (no memory op)")

    # Stage 5: Write Back
    registers[rd] = result
    print(f"WB:  R{rd} <- {result}\n")

print("Final register state:", registers)

Running this shows each instruction marching through all five stages. This is structurally identical to what real hardware does—just at the speed of Python instead of a few nanoseconds.

🔬 Going Deeper

The PC is the heart of control flow

The Program Counter is the CPU's "you are here" marker, but it doesn't always step forward quietly. A JMP instruction forcibly sets PC to a new address. A CALL saves the current PC onto the stack and then jumps to a function. Loops, branches, and function calls are, at their core, all just manipulations of the PC. Every if, while, and function you've ever written compiles down to instructions that move the PC around.

How interrupts cut in line

Your program is running fine when suddenly a key is pressed. How does the CPU know? Through interrupts. When a hardware interrupt fires, the CPU finishes its current instruction, saves the PC (and other state) somewhere safe, then jumps to a special interrupt handler routine. After the handler finishes, everything is restored and the original program continues—none the wiser. Interrupts can inject themselves between any two instructions, which is what makes keyboards, timers, and network cards work.

Five stages is just the beginning

The five-stage pipeline we've described is the classic MIPS textbook model. Modern x86 chips like Intel's Core series or AMD's Zen architecture have well over a dozen internal stages—some designs approach 20 or more. More stages means higher theoretical throughput. But it also means that when something goes wrong (like a branch prediction miss), the CPU has to flush more work in progress, wasting more cycles. Pipeline depth is an engineering tradeoff, not a free lunch.

Where to learn more

Rate this chapter
4.7  / 5  (56 ratings)

💬 Comments