Chapter 21

Out-of-Order Execution: The CPU Is Smarter Than You

Out-of-Order Execution: The CPU Is Smarter Than You Think

Picture yourself in a cafeteria lunch line. The person in front of you is waiting for the kitchen staff to bring out braised pork โ€” a three-minute wait. Standing there doing absolutely nothing wastes those three minutes. The smart move is to grab a soup bowl, pick up some bread, and come back for the pork when it's ready. The final tray looks exactly the same, but you finished faster.

Modern CPUs do exactly this. When an instruction is stalled waiting for data โ€” say, a value that has to be fetched from main memory โ€” the CPU doesn't sit idle. It quietly executes dozens of later instructions, then "fills in" the results once the data arrives. This is Out-of-Order Execution (OoOE).

Core Concepts

Why In-Order Execution Wastes Time

Early CPUs executed instructions strictly in program order, one at a time. The problem is that real programs are riddled with data dependencies:

int a = load(address1);   // Inst 1: fetch from memory โ€” costs 100+ ns
int b = a + 1;            // Inst 2: must wait for inst 1
int c = load(address2);   // Inst 3: completely unrelated to inst 1 & 2!

In strict order, instruction 3 cannot start until both instructions 1 and 2 finish โ€” even though it has nothing to do with them. That's pure waste.

The Three Hardware Engines That Make OoOE Work

Program-order instruction stream
            โ”‚
            โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  Decode / Issue โ”‚   Break instructions into micro-ops (ฮผops)
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
           โ”‚
           โ–ผ
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚       Register Renaming Unit        โ”‚
  โ”‚  Maps logical โ†’ physical registers  โ”‚
  โ”‚  Eliminates false dependencies      โ”‚
  โ”‚  (WAR / WAW hazards disappear)      โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚    Reservation Station โ”‚
         โ”‚  Hold ฮผops until all   โ”‚
         โ”‚  operands are ready,   โ”‚
         โ”‚  then fire out-of-orderโ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚  OoO dispatch
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚   Execution Units      โ”‚
         โ”‚   (ALU, FPU, Load/Storeโ”‚
         โ”‚    โ€” running OoO)      โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚  Reorder Buffer (ROB)  โ”‚
         โ”‚  Commit results back   โ”‚
         โ”‚  in program order      โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Register Renaming eliminates false dependencies. If two unrelated instructions both write to r1, the CPU assigns them different physical registers so they can execute simultaneously. The programmer never sees this happen.

Reservation Station (RS) acts as a waiting room. Every instruction enters the RS along with tags describing which earlier instructions it needs. When an execution unit broadcasts its result, any waiting instruction whose tag matches immediately becomes eligible to fire โ€” regardless of the original program order.

Reorder Buffer (ROB) enforces the illusion of in-order execution. Results from out-of-order execution are written into the ROB first. Only when an instruction reaches the head of the ROB โ€” meaning all prior instructions have already committed โ€” does its result get written to the architectural state (register file, memory). From software's perspective, instructions appear to execute sequentially.

The Tomasulo Algorithm: The Theoretical Foundation

In 1967, IBM engineer Robert Tomasulo designed this scheduling scheme for the IBM System/360 Model 91 floating-point unit. The key insight:

  1. When an instruction enters the RS, replace its missing operands with tags โ€” labels that say "waiting for the result of instruction X."
  2. When an execution unit finishes, it broadcasts its result alongside its tag.
  3. Every RS entry monitors the broadcast. The moment all operands arrive, the instruction fires immediately.

This idea is almost verbatim what modern CPUs implement, scaled from a handful of execution units to several dozen.

The Security Downside: Spectre

Out-of-order execution brought an unexpected security crisis. The Spectre vulnerability, disclosed in January 2018, exploits OoOE:

An attacker constructs code that tricks the CPU into speculatively executing an out-of-bounds memory access. The instruction is eventually discarded โ€” it never commits โ€” but the act of executing it loads a cache line. That cache line leaves a timing fingerprint: measuring how long it takes to access different memory addresses reveals which address was secretly loaded, leaking data the attacker should never see.

The root cause: OoOE erases the "virtual isolation" between instructions. Speculative side effects (cache state changes) outlive the instructions that caused them.

Hands-On Verification

The following C program demonstrates how dependency chains limit out-of-order parallelism. dependent_loop has a chain where each iteration depends on the previous result โ€” the CPU cannot reorder past it. independent_loop breaks the problem into four independent chains that the CPU can advance simultaneously:

#include <stdio.h>
#include <time.h>
#include <stdint.h>

#define N 1000000000LL

// Every iteration depends on the previous โ€” strict serial chain
uint64_t dependent_loop() {
    uint64_t sum = 0;
    for (long long i = 0; i < N; i++) {
        sum = sum * 3 + 1;
    }
    return sum;
}

// Four independent chains โ€” CPU can run them in parallel via OoOE
uint64_t independent_loop() {
    uint64_t s0 = 0, s1 = 0, s2 = 0, s3 = 0;
    for (long long i = 0; i < N / 4; i++) {
        s0 = s0 * 3 + 1;
        s1 = s1 * 5 + 2;
        s2 = s2 * 7 + 3;
        s3 = s3 * 9 + 4;
    }
    return s0 + s1 + s2 + s3;
}

double elapsed_ms(struct timespec a, struct timespec b) {
    return (b.tv_sec - a.tv_sec) * 1000.0
         + (b.tv_nsec - a.tv_nsec) / 1e6;
}

int main() {
    struct timespec t0, t1;

    clock_gettime(CLOCK_MONOTONIC, &t0);
    volatile uint64_t r1 = dependent_loop();
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("Dependent chain: %.1f ms  (result=%lu)\n", elapsed_ms(t0, t1), r1);

    clock_gettime(CLOCK_MONOTONIC, &t0);
    volatile uint64_t r2 = independent_loop();
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("Independent:     %.1f ms  (result=%lu)\n", elapsed_ms(t0, t1), r2);

    return 0;
}

Build and run:

gcc -O2 -o ooo ooo.c && ./ooo

Typical results (Apple M2 / Intel Core i7):

Dependent chain: 1200 ms
Independent:     350 ms

The independent version is roughly 3ร— faster โ€” the CPU sees four unrelated computation streams and drives them forward simultaneously through out-of-order execution.

๐Ÿ”ฌ Going Deeper

How big is the OoOE window? The ROB size determines how many in-flight instructions the CPU can juggle. Intel's Sapphire Rapids has an ROB of approximately 512 entries; Apple's M3 is estimated at over 600. A larger ROB lets the CPU "see" further ahead and find more independent work. But bigger ROBs consume more die area and power โ€” every entry needs comparison logic, wiring, and arbitration. ROB sizing is one of the central microarchitecture tradeoffs.

How does memory ordering stay consistent? The CPU allows loads and stores to execute out of order internally, but commits them strictly in program order. Store instructions write to a Store Buffer during execution; they only drain into the L1 cache when the store reaches the head of the ROB and retires. This is what enables x86's Total Store Order (TSO) memory model โ€” stores appear globally visible in program order, even though the core executed them out of order internally. Other ISAs (ARM, RISC-V) allow weaker ordering and require explicit memory barriers to enforce stricter sequencing.

The long shadow of Spectre turned out to be far wider than anyone expected. Mitigations required inserting LFENCE barriers into kernel and hypervisor hot paths, slowing some Linux kernel operations by 5โ€“30% on affected Intel processors. The lesson: microarchitectural performance optimizations can quietly undermine security assumptions that software has relied on for decades. For a rigorous treatment of out-of-order pipelines, read Chapter 3 of Computer Architecture: A Quantitative Approach by Hennessy and Patterson (6th edition). For a practitioner's perspective on measuring and tuning CPU execution โ€” including how to observe ROB utilization with perf stat โ€” see Chapter 6 of Brendan Gregg's Systems Performance: Enterprise and the Cloud.

Rate this chapter
4.7  / 5  (8 ratings)

๐Ÿ’ฌ Comments