Chapter 7

Pipelining: The Factory Mindset of a CPU

Pipelining: The Factory Mindset

Do you have a washer and a dryer? Suppose you have three loads of laundry. Each load takes 30 minutes to wash and 30 minutes to dry.

The naive approach: wash load 1, dry load 1, wash load 2, dry load 2, wash load 3, dry load 3. Total: 180 minutes.

The smart approach: while load 1 is drying, start washing load 2. While load 2 is drying, start load 3. Total: 120 minutes.

That's pipelining in a nutshell. Instead of finishing one task completely before starting the next, you keep every workstation busy simultaneously by overlapping tasks at different stages. CPUs use the exact same idea: while one instruction is being executed, the next is being decoded, and the one after that is being fetched—all at the same time.

Core Concepts

The Five-Stage Pipeline in Time

Without pipelining, instructions execute one after another. If each stage takes one clock cycle:

No pipelining (5 instructions, 5 cycles each):
Cycle:       1    2    3    4    5    6    7  ...  25
Instruction 1:  [IF][ID][EX][ME][WB]
Instruction 2:                        [IF][ID][EX][ME][WB]
...
Total: 25 cycles to complete 5 instructions.

With pipelining, each new instruction starts one cycle after the previous one:

With pipelining (5 instructions, done by cycle 9):
Cycle:       1    2    3    4    5    6    7    8    9
Instr 1:    [IF][ID][EX][ME][WB]
Instr 2:         [IF][ID][EX][ME][WB]
Instr 3:              [IF][ID][EX][ME][WB]
Instr 4:                   [IF][ID][EX][ME][WB]
Instr 5:                        [IF][ID][EX][ME][WB]
At cycle 5, all five pipeline stages are working simultaneously!

Five instructions in 9 cycles instead of 25. Once the pipeline is "full" and running steadily, it completes one instruction per cycle—this is what CPU engineers mean when they say IPC = 1 (Instructions Per Cycle).

Throughput vs. Latency

Pipelining is a subtle tradeoff worth understanding clearly:

Latency (time for a single instruction) doesn't improve. An ADD still takes 5 cycles to complete.
Throughput (instructions completed per unit time) improves dramatically.

Like a car assembly line: each car still takes days to build, but the factory ships a new car every few minutes. Pipelining accelerates the factory's output rate, not any individual car's build time.

Three Kinds of Pipeline Hazards

Reality is messier than the ideal diagram. Instructions often depend on each other or compete for resources, causing the pipeline to stall—these situations are called pipeline hazards.

Hazard 1: Data Hazard

An instruction needs a result that the previous instruction hasn't finished computing yet:

ADD R1, R2, R3    // produces R1
ADD R4, R1, R5    // needs R1 — but R1 isn't written back yet!

Cycle:      1    2    3    4    5    6    7    8
ADD R1:    [IF][ID][EX][ME][WB]
ADD R4:         [IF][ID][??]         [EX][ME][WB]
                         ↑
              ID reads R1, but WB hasn't happened yet
              → pipeline must stall (insert "bubble")

Solution A: Stall — insert idle cycles to wait for the result.

Solution B: Forwarding (Bypassing) — wire the ALU output directly to the next instruction's ALU input, without waiting for write-back:

ADD R1:    [IF][ID][EX]────────────►[ME][WB]
ADD R4:         [IF][ID]        [EX][ME][WB]
                         forwarded ↑
                    EX result goes directly to next EX — no stall needed

Modern CPUs implement forwarding extensively; most data hazards disappear without any stall at all.

Hazard 2: Control Hazard

Branch instructions (if/else, while) create uncertainty about which instruction comes next:

CMP  R1, 0         ; compare
JE   label_done    ; jump if equal — but we don't know yet!
ADD  R2, R3, R4    ; maybe execute this...
...
label_done:
SUB  R5, R6, R7   ; ...or maybe this

When the CPU fetches JE, it doesn't yet know whether the branch will be taken—that's determined in EX. Instructions fetched during this gap may be wrong and need to be thrown away (flushed).

Solution: Branch Prediction. Modern CPUs maintain tables tracking the history of every branch: "this loop branch was taken 97 times in a row; predict taken again." Prediction accuracy regularly exceeds 95-99%. When a prediction is wrong, the incorrectly fetched instructions are flushed and the pipeline restarts from the correct address—called a branch misprediction penalty, typically 10-20 wasted cycles on modern hardware.

Hazard 3: Structural Hazard

Two instructions simultaneously need the same hardware resource—for example, both trying to access memory in the same cycle.

Solution: Divide resources. Separate the Level 1 cache into an instruction cache (I-cache) and a data cache (D-cache), so instruction fetching and data memory access can happen in parallel without conflict.

Why 4 GHz Doesn't Mean 4 Billion Operations Per Second

People often assume a 4 GHz chip does exactly 4 billion operations every second. Real performance is both lower and higher than that:

Lower because:

Stalls from data hazards waste cycles
Cache misses force the pipeline to wait 50-200+ cycles for data from RAM
Branch mispredictions flush the pipeline and restart

Higher because:

Modern CPUs are superscalar: they can issue multiple independent instructions per cycle (typically 4-6 on modern x86). So IPC can exceed 1—sometimes reaching 3-4 on well-structured code.

Clock frequency is one axis. IPC is the other. Both matter, which is why Apple's M-series chips at 3.2 GHz often outperform x86 chips at 5 GHz.

Try It Yourself

On Linux, perf stat shows actual IPC:

perf stat ./your_program

# Sample output:
#   8,234,567,890  instructions       # total instructions retired
#   3,456,789,012  cycles             # total clock cycles
#               2.38  insn per cycle  # your IPC

High IPC (2-4) means the pipeline is flowing smoothly. Matrix multiplication—regular memory access, few branches—often achieves IPC of 3+. JSON parsing—irregular branches, pointer chasing—often falls below 1.0.

🔬 Going Deeper

Branch prediction and security: Spectre and Meltdown

In 2018, researchers revealed two catastrophic CPU vulnerabilities: Spectre and Meltdown. Both exploited the fact that CPUs speculatively execute instructions down the predicted branch path before knowing whether the branch is actually taken. If the prediction is wrong, the computed results are discarded—but the cache state changes caused by the speculative execution remain. Attackers could measure those cache side-effects to infer data the program was never supposed to access. This demonstrated that pipeline optimizations aren't just a performance topic; they directly affect security.

Out-of-order execution: the next level

Modern CPUs go far beyond in-order pipelining. Out-of-order execution (OoOE) lets the CPU look ahead at dozens of upcoming instructions, find ones that don't depend on each other, and execute them in a rearranged order that keeps every execution unit busy. The results are then "retired" (committed) in original program order, so the program never knows the difference. IBM engineer Robert Tomasulo described the foundational algorithm for this in 1967; it's still the basis for how modern out-of-order processors work.

Where to learn more

Computer Organization and Design by Patterson & Hennessy — Chapters 4-5 cover pipeline hazards, forwarding, and branch prediction with detailed diagrams.
Computer Systems: A Programmer's Perspective (CSAPP) — Chapter 5 teaches you to write code that plays nicely with the CPU pipeline: how to eliminate unnecessary dependencies and help the branch predictor.
Dan Luu's blog post "Branch prediction" (danluu.com) — A thorough, engineering-focused explanation of how real branch predictors work, with concrete performance numbers.

Rate this chapter

4.6 / 5 (50 ratings)