Chapter 24

The Cost of Multi-Core

"My server has 64 cores. Why does running 64 threads only give me a 10× speedup?"

This question frustrates engineers endlessly. Intuition says 64 workers should be 64× faster than one. But reality says otherwise: the more workers you add, the higher the coordination overhead. When everyone shares the same toolbox (memory), queues to report progress (bus communication), and waits on each other's locks, the time spent waiting can dwarf the time spent working.

Multi-core processors bring more compute — but also a cost structure you must understand to use them effectively.

Core Concepts

Amdahl's Law: The Ceiling on Parallelism

In 1967, Gene Amdahl published a deceptively simple formula:

Speedup S = 1 / ((1 - P) + P/N)

P = fraction of the program that can be parallelized
N = number of cores

Suppose 10% of your program is inherently serial (locks, initialization, log aggregation) and 90% is perfectly parallel:

N=1:   1.0×
N=4:   3.08×
N=8:   4.71×
N=16:  6.40×
N=64:  8.77×
N=∞:  10.0×   ← hard ceiling — no amount of cores can break this

With just 10% serial code, 64 cores achieve only ~87% of the theoretical maximum. The serial portion is a one-lane road that every thread must queue up and pass through single-file. Adding more lanes on either side changes nothing at the bottleneck.

NUMA: Not All Memory Is Created Equal

Modern multi-socket servers (dual Xeon, for example) don't hang all RAM off a single shared bus. Instead, they use NUMA (Non-Uniform Memory Access): each CPU socket has its own local memory that it can reach directly. Accessing the other socket's memory requires traversing a high-speed interconnect (Intel's QPI/UPI, AMD's Infinity Fabric).

Socket 0                        Socket 1
┌──────────────────┐  QPI/UPI  ┌──────────────────┐
│  CPUs 0–31       │◄─────────►│  CPUs 32–63      │
│  ┌────────────┐  │           │  ┌────────────┐  │
│  │  L3 Cache  │  │           │  │  L3 Cache  │  │
│  └────────────┘  │           │  └────────────┘  │
│  Local memory    │           │  Local memory    │
│  latency ~65 ns  │           │  latency ~65 ns  │
└──────────────────┘           └──────────────────┘
      ↑ Cross-socket access latency: ~130–200 ns (2–3× slower)

A thread running on Socket 0 whose memory was allocated on Socket 1 takes the long way around on every single memory access. That's NUMA remote access, with latency 3–5× higher than local. On Linux, numactl --hardware shows the topology and numastat tracks remote access counts.

This is one of the most insidious bugs in multi-core programming. Two threads appear to modify completely independent variables — yet they constantly stall each other.

struct Counters {
    long a;   // modified by thread 0
    long b;   // modified by thread 1
};

Both a and b are 8 bytes. Together they fit inside a single 64-byte cache line. When thread 0 writes a, the entire cache line is marked dirty, and the copy in thread 1's CPU cache is invalidated. Thread 1 must reload the whole line from memory to access b, even though b hasn't changed. Thread 1 writes b, the same thing happens in reverse. This ping-pong continues for every single increment.

Cache line (64 bytes):
┌──────────────────────────────────────────────────┐
│  a (8B)  │  b (8B)  │  ...padding (48B)...       │
│ written  │ written  │                            │
│ by T0    │ by T1    │                            │
└──────────────────────────────────────────────────┘
      ↑ Same cache line! Modifying either field invalidates the other CPU's copy

Fix: pad each variable to its own cache line:

struct Counters {
    long a;
    char _pad0[56];   // pad to 64 bytes
    long b;
    char _pad1[56];
};
// Or with C11 alignment:
struct alignas(64) Counter { long val; };

The MESI Protocol: Cache Coherence Basics

Every core has its own L1/L2 cache. If two cores each cache the same memory address, and one modifies it, what should the other core see? The answer requires a coherence protocol. MESI uses four states per cache line:

M (Modified)  — This core has modified the line; other copies are invalid;
                must write back to memory before eviction
E (Exclusive) — Only this core has the line; memory is up-to-date;
                can modify without broadcasting
S (Shared)    — Multiple cores hold copies; reads are fine;
                must transition to E/M before writing
I (Invalid)   — This copy is stale; must reload from memory/another cache

When thread 0 writes a, the cache controller broadcasts an invalidation message on the interconnect, flipping the cache line to I on every other core. When thread 1 next accesses b (on the same line), it finds I and must reload — even though b was untouched. That is false sharing: b pays the price for a's write.

Memory Barriers

Each CPU has its own Store Buffer and Load Buffer. Writes don't necessarily become visible to other cores immediately. A memory barrier is an instruction that enforces ordering — everything before the barrier is committed before anything after it begins:

// Without barrier: another core might see flag=1 before data=42
data = 42;
flag = 1;

// With barrier: guarantees any core seeing flag=1 also sees data=42
data = 42;
__sync_synchronize();   // GCC full memory barrier
flag = 1;

On x86, mfence is the full barrier; sfence and lfence apply to stores and loads respectively. ARM has a much weaker memory model and requires barriers more aggressively. This is why multi-threaded x86 code occasionally exhibits mysterious race conditions when ported to ARM servers — correct barriers that x86's strong ordering provided implicitly must now be made explicit.

Hands-On Verification

This program demonstrates the false sharing penalty:

#include <stdio.h>
#include <pthread.h>
#include <time.h>

#define ITERATIONS 500000000L

// False sharing: both counters share one cache line
struct Shared { long a, b; };

// Fixed: each counter occupies its own cache line
struct Padded { long a; char pad[56]; long b; };

typedef struct { void *counters; int which; } Args;

void *worker_shared(void *arg) {
    Args *a = arg;
    struct Shared *c = a->counters;
    if (a->which == 0)
        for (long i = 0; i < ITERATIONS; i++) c->a++;
    else
        for (long i = 0; i < ITERATIONS; i++) c->b++;
    return NULL;
}

void *worker_padded(void *arg) {
    Args *a = arg;
    struct Padded *c = a->counters;
    if (a->which == 0)
        for (long i = 0; i < ITERATIONS; i++) c->a++;
    else
        for (long i = 0; i < ITERATIONS; i++) c->b++;
    return NULL;
}

double run(void *(*fn)(void*), void *counters) {
    pthread_t t0, t1;
    Args a0 = {counters, 0}, a1 = {counters, 1};
    struct timespec s, e;
    clock_gettime(CLOCK_MONOTONIC, &s);
    pthread_create(&t0, NULL, fn, &a0);
    pthread_create(&t1, NULL, fn, &a1);
    pthread_join(t0, NULL); pthread_join(t1, NULL);
    clock_gettime(CLOCK_MONOTONIC, &e);
    return (e.tv_sec-s.tv_sec)*1e3 + (e.tv_nsec-s.tv_nsec)/1e6;
}

int main() {
    struct Shared s = {0, 0};
    struct Padded p = {0, {0}, 0};
    printf("False sharing (same cache line): %.0f ms\n", run(worker_shared, &s));
    printf("Padded (separate cache lines):   %.0f ms\n", run(worker_padded, &p));
    return 0;
}

gcc -O2 -pthread -o fs false_sharing.c && ./fs

Typical output:

False sharing (same cache line): 4120 ms
Padded (separate cache lines):    430 ms

A 10× difference — from two threads that appear to do independent work. They aren't fighting over data logically, but they're fighting over cache lines physically.

🔬 Going Deeper

NUMA-aware programming is essential for large-memory systems. Linux exposes libnuma, which allows explicit memory placement (numa_alloc_local) and thread-to-node binding (numa_run_on_node). Databases like PostgreSQL and Redis have NUMA-aware allocation modes. For data-intensive workloads, NUMA optimization alone can deliver 2–3× throughput improvement by eliminating cross-socket memory hops. The key diagnostic tool is numastat -p <pid>, which shows how much of a process's memory accesses are local vs. remote.

The hidden cost of atomic operations is frequently underestimated. On x86, atomic_fetch_add compiles to lock xadd, which acquires exclusive ownership of the cache line (equivalent to sending an MESI invalidation to every other core, then doing the write). This costs roughly 20–50 ns per operation — 10–100× slower than a plain integer add. For high-throughput counters shared across many threads, the standard optimization is per-thread local counters that are summed at the end, eliminating contention on the shared variable entirely. This is the approach used in the Linux kernel's percpu counter infrastructure and in most high-performance profiling systems.

The layered reality of Amdahl's Law: the classical formula counts only sequential code sections. Real programs have additional hidden serialization: NUMA latency spikes when the wrong thread runs on the wrong socket, false sharing silently serializes independent threads, atomic operations serialize at the hardware level. Finding these "accidental serial points" requires tools — perf c2c on Linux specifically detects false sharing by tracking cache line bouncing events. Intel VTune and AMD uProf provide NUMA and cache coherence traffic visualization. For deeper reading, Brendan Gregg's Systems Performance: Enterprise and the Cloud, Chapters 6 and 11, covers CPU and NUMA performance analysis in production systems. Paul McKenney's free book Is Parallel Programming Hard, And, If So, What Can You Do About It? is the definitive engineer's guide to memory barriers and cache coherence — written by the lead developer of the Linux RCU subsystem, it bridges theory and practice at exactly the right level.

Rate this chapter

4.6 / 5 (5 ratings)

The Cost of Multi-Core

The Cost of Multi-Core

Core Concepts

Amdahl's Law: The Ceiling on Parallelism

NUMA: Not All Memory Is Created Equal

False Sharing: The Invisible Performance Killer

The MESI Protocol: Cache Coherence Basics

Memory Barriers

Hands-On Verification

🔬 Going Deeper

💬 Comments