Chapter 22

Branch Prediction: Betting to Save Time

Branch Prediction: Guess Right and Save Time

Imagine driving on a highway at 200 km/h. There's a fork ahead. If you wait until you see the sign at the junction to decide which way to turn, it's already too late — you need to commit hundreds of meters in advance. Guess right: you barrel through at full speed. Guess wrong: emergency braking, U-turn, fifteen seconds lost.

The if-else branch in your code is exactly that fork. Pipelining means the CPU is fetching instructions dozens of cycles ahead of the one currently executing. When it hits a conditional jump, it doesn't yet know which way the branch will go — so it has to guess. Guess wrong and the pipeline gets flushed. The penalty is steep.

Core Concepts

Control Hazards: How if-else Stalls the Pipeline

In a classic five-stage pipeline (Fetch → Decode → Execute → Memory → Write-back), the Fetch stage runs four cycles ahead of Execute. When a conditional branch instruction reaches Execute, the CPU is already 4 instructions deep into the predicted path. If the prediction is wrong, those instructions must be discarded and the pipeline must restart from the correct address.

The cost of a misprediction on a modern CPU: roughly 15–20 clock cycles — comparable to an L1 cache miss.

Clock cycle:     1     2     3     4     5     6     7
Branch instr:  Fetch Decode Exec  Mem  WB
Wrong path A:        Fetch Decode Exec  ← FLUSH! All wasted
                                ↑
               Branch resolved here — too late, fetched the wrong path

Static vs. Dynamic Prediction

Static prediction ignores history and applies fixed rules:

Accuracy: roughly 55–65%. Only marginally better than a coin flip for general code.

Dynamic prediction uses execution history to decide. The CPU watches what the branch has done in the past and uses that to predict the future.

The 2-Bit Saturating Counter

The classic dynamic predictor gives every branch a 2-bit counter (values 0–3). The high bit determines the prediction:

State machine:
  00 (Strongly NT) → 01 (Weakly NT) → 10 (Weakly T) → 11 (Strongly T)
        ↑                                                      │
        └──────────────────────────────────────────────────────┘

  Branch taken    → counter +1 (max 3)
  Branch not taken → counter -1 (min 0)
  High bit = 1 (states 10 or 11) → predict Taken
  High bit = 0 (states 00 or 01) → predict Not Taken

The 2-bit design has a useful property: a single misprediction doesn't immediately flip the prediction. Two consecutive wrong predictions are required to cross the midpoint. This makes the predictor resilient to occasional aberrations in an otherwise regular pattern.

How Accurate Are Modern Predictors?

A simple 2-bit counter only remembers the most recent outcome — it fails completely on alternating patterns (T, NT, T, NT, …). Modern CPUs add a Branch History Register (BHR) that records the last N branch outcomes as a shift register. This history is hashed with the branch address to index into a Pattern History Table (PHT) of 2-bit counters. Longer history → more complex patterns recognized.

The current state of the art is the TAGE predictor (Tagged Geometric History Length), which maintains multiple tables with geometrically increasing history lengths and takes the prediction from the longest matching entry. Intel and AMD both use TAGE-inspired designs in their high-performance cores. Typical accuracy exceeds 95%, reaching 98–99% on regular workloads.

The Real Cost of Misprediction

Misprediction penalty = pipeline depth × clock period

Intel Core i9 (≈20 stage pipeline, 3 GHz):
  Penalty ≈ 15–20 cycles
  Per misprediction ≈ 5–7 nanoseconds

If branches appear every 4 instructions, error rate 5%:
  CPI impact ≈ 0.05 × 17 / 4 ≈ 0.21 cycles/instruction (~20% throughput loss)

Hands-On Verification

The classic "sorted vs. unsorted array" benchmark makes branch prediction effects strikingly visible:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

#define SIZE (1 << 20)   // 1M elements
#define THRESHOLD 128

int cmp(const void *a, const void *b) {
    return *(int*)a - *(int*)b;
}

long long sum_with_branch(int *arr, int n) {
    long long s = 0;
    for (int i = 0; i < n; i++) {
        if (arr[i] >= THRESHOLD)   // branch outcome is random on unsorted data
            s += arr[i];
    }
    return s;
}

int main() {
    int *data   = malloc(SIZE * sizeof(int));
    int *sorted = malloc(SIZE * sizeof(int));
    srand(42);

    for (int i = 0; i < SIZE; i++)
        data[i] = rand() & 0xFF;   // 0–255

    memcpy(sorted, data, SIZE * sizeof(int));
    qsort(sorted, SIZE, sizeof(int), cmp);

    struct timespec t0, t1;
    int RUNS = 10;

    // Unsorted: branch outcome is ~50% random → misprediction rate ~50%
    clock_gettime(CLOCK_MONOTONIC, &t0);
    volatile long long r1 = 0;
    for (int r = 0; r < RUNS; r++) r1 += sum_with_branch(data, SIZE);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("Unsorted: %.1f ms\n",
           ((t1.tv_sec-t0.tv_sec)*1e3 + (t1.tv_nsec-t0.tv_nsec)/1e6)/RUNS);

    // Sorted: all "not taken" then all "taken" → misprediction rate ~0%
    clock_gettime(CLOCK_MONOTONIC, &t0);
    volatile long long r2 = 0;
    for (int r = 0; r < RUNS; r++) r2 += sum_with_branch(sorted, SIZE);
    clock_gettime(CLOCK_MONOTONIC, &t1);
    printf("Sorted:   %.1f ms\n",
           ((t1.tv_sec-t0.tv_sec)*1e3 + (t1.tv_nsec-t0.tv_nsec)/1e6)/RUNS);

    free(data); free(sorted);
    return 0;
}
gcc -O2 -o bp bp.c && ./bp

Typical output:

Unsorted: 8.3 ms
Sorted:   2.1 ms

About 4× faster — not because sorting changes the algorithm's work, but because it makes the branch outcome perfectly predictable. The CPU can pipeline at full speed without ever flushing.

🔬 Going Deeper

Branchless programming eliminates the branch entirely, converting if to arithmetic. For the loop above:

// Branchless: performance is stable regardless of data order
s += arr[i] & -(arr[i] >= THRESHOLD);
// If arr[i] >= THRESHOLD: -(1) = 0xFFFF...  → AND keeps the value
// If arr[i] <  THRESHOLD: -(0) = 0          → AND zeroes it out

This trades readability for predictable performance. Use it only after profiling confirms the branch is a genuine bottleneck — the compiler often generates branchless code on its own with -O2 when it can prove it's safe.

Indirect branch prediction handles jumps whose target address isn't known at compile time: C++ virtual function calls, function pointers, switch tables with many cases. The CPU maintains an Indirect Branch Target Buffer (IBTB) that records historical targets for each call site. Spectre Variant 2 attacks this mechanism — it trains the IBTB with a malicious target address so that when victim code makes a legitimate indirect jump, the CPU speculatively jumps to an attacker-controlled gadget, leaking secrets through cache timing.

Branch prediction and machine learning is an active research area. The Perceptron predictor (Jiménez and Lin, 2001) uses a linear classifier whose weights are updated based on prediction outcomes, capturing correlations across far longer histories than table-based schemes. The TAGE predictor (André Seznec, 2006, Journal of Instruction-Level Parallelism) is the definitive modern design and is required reading if you want to understand what ships in Intel and AMD CPUs today. For a hands-on measurement approach — including how to count branch misses with perf stat -e branch-misses,branch-instructions — see Brendan Gregg's Systems Performance: Enterprise and the Cloud, Chapter 6. For the theoretical foundations of pipeline hazards and dynamic prediction, Computer Architecture: A Quantitative Approach by Hennessy and Patterson, Chapter 3, remains the authoritative reference.

Rate this chapter
4.5  / 5  (7 ratings)

💬 Comments