Chapter 17

Threads and Concurrency

Imagine a restaurant kitchen with a single chef. Every dish must wait until the previous one is finished — hopelessly slow. Now hire five chefs who share the same kitchen: same stoves, same pantry, same utensils. Each chef has their own cutting board and their own pace. That shared-kitchen arrangement is exactly what multithreading looks like.

A process is like a whole company. A thread is like an employee inside it. Everyone in the company shares the office, the filing server, and the accounting system — but each person has their own task list and working memory. The benefit of sharing is low communication overhead (pass data directly through shared memory). The risk is that one careless employee can corrupt a shared file and take everyone down with them.

Core Concepts

Process vs. Thread: What Is Shared, What Is Not

┌─────────────────────────────────────────────────────────┐
│                   Process Address Space                  │
│                                                         │
│  ┌─────────────────────────────────────────────────┐   │
│  │              Shared by All Threads               │   │
│  │   .text (code)   .data (globals)   Heap          │   │
│  │   File descriptors   Signal table   Page tables  │   │
│  └─────────────────────────────────────────────────┘   │
│                                                         │
│   Thread 1          Thread 2          Thread 3          │
│  ┌──────────┐     ┌──────────┐     ┌──────────┐        │
│  │Own Stack │     │Own Stack │     │Own Stack │        │
│  │Thread ID │     │Thread ID │     │Thread ID │        │
│  │Registers │     │Registers │     │Registers │        │
│  │Prog Ctr  │     │Prog Ctr  │     │Prog Ctr  │        │
│  └──────────┘     └──────────┘     └──────────┘        │
└─────────────────────────────────────────────────────────┘

Each thread owns privately:

Stack — local variables and call frames, invisible to other threads
Program Counter (PC) — which instruction is currently executing
Register state — the CPU snapshot saved during a context switch

All threads share:

Code segment (all threads execute the same code)
Heap (memory from malloc is accessible to any thread)
Global and static variables
Open file handles and network connections

Concurrency vs. Parallelism: Appearing Simultaneous vs. Actually Simultaneous

These two words are routinely conflated. They mean different things:

Concurrency: one CPU rapidly switches between tasks (illusion of simultaneity)
Timeline:
CPU: ─A─A─B─B─A─A─C─C─B─B─C─C─►
     Tasks A, B, C take turns. Only one runs at any instant.

Parallelism: multiple CPUs truly running at the same time
Timeline:
Core 0: ──A──A──A──A──►
Core 1: ──B──B──B──B──►
Core 2: ──C──C──C──C──►
        A, B, C execute simultaneously on different cores.

If your Python program has 4 threads but you only have 1 physical CPU core, you have concurrency — not parallelism. Concurrency is a design strategy (how you structure your code). Parallelism is a hardware execution reality (how many cores are actually working at once). A well-designed concurrent program automatically gains parallelism when run on a multi-core machine.

The Cost of a Context Switch

Thread switching is not free. Every time the OS switches threads, it must:

Save the current thread's registers (general-purpose registers + PC + status flags) to memory
Restore the next thread's registers from memory
If switching to a different process, also flush the TLB (translation lookaside buffer)

One context switch costs roughly 1–10 microseconds. That sounds tiny, but with 1,000 threads context-switching constantly, the overhead itself becomes the bottleneck. This is one of the reasons coroutines became so popular.

User-Mode Threads vs. Kernel-Mode Threads

User-mode threads (M:1 model)
  Many user threads mapped to one kernel thread
  ┌──────────────────────────────┐
  │  Thread library (user sched) │
  │   Thread A  Thread B  Thread C│
  └──────────────┬───────────────┘
                 │ (just 1)
  ┌──────────────▼───────────────┐
  │     OS kernel thread          │
  └──────────────────────────────┘
  Pro: switching is fast (no syscall)
  Con: one blocked thread blocks all

Kernel-mode threads (1:1 model — what Linux uses)
  Thread A ──► Kernel thread 1 ──► Core 0
  Thread B ──► Kernel thread 2 ──► Core 1
  Thread C ──► Kernel thread 3 ──► Core 0/1
  Pro: true parallelism; one blocked thread doesn't affect others
  Con: creation and switching require syscalls

Coroutines: The Lightweight Alternative

A coroutine is a unit of execution that yields control voluntarily — no kernel involvement, no system call, no context switch overhead. Switching between coroutines costs nanoseconds rather than microseconds. Python's async/await, Go's goroutines, and Rust's async are all built on this idea.

import asyncio

async def boil_water():
    print("Boiling water...")
    await asyncio.sleep(3)   # yield control; don't block the loop
    print("Water ready, steeping tea")

async def cut_fruit():
    print("Slicing apple...")
    await asyncio.sleep(1)
    print("Apple sliced")

async def main():
    await asyncio.gather(boil_water(), cut_fruit())

asyncio.run(main())
# Both tasks run "concurrently"; total time is 3s, not 4s

Why Python Has the GIL

Python's GIL (Global Interpreter Lock) ensures that only one thread executes Python bytecode at any given moment. Why?

CPython uses reference counting for garbage collection. Every object carries a counter: it increments when something references the object, and decrements when the reference is dropped. When the count reaches zero, memory is freed. If two threads simultaneously modify the same object's reference count, the counter can go corrupt, leading to memory leaks or crashes. The GIL is the blunt-but-effective solution: one big lock prevents concurrent reference-count updates.

The price: Python threads cannot exploit multiple cores for CPU-bound work. The workarounds are multiprocessing (separate processes, each with its own GIL) or C extensions like NumPy (which release the GIL during heavy computation). Python 3.13 began experimenting with removing the GIL via a per-interpreter lock model.

Hands-On Verification

# Count the threads of a running Python process
ps -Lp $(pgrep -n python3)

# Or read it from /proc
cat /proc/$(pgrep -n python3)/status | grep Threads

# Demonstrate shared heap — and a hidden race condition
import threading

counter = 0   # global, shared by all threads

def increment():
    global counter
    for _ in range(100_000):
        counter += 1   # NOT atomic — potential race condition

threads = [threading.Thread(target=increment) for _ in range(5)]
for t in threads: t.start()
for t in threads: t.join()

print(f"Expected: 500000, Got: {counter}")
# Python's GIL masks the race here, but in C or Java without locks
# you would regularly see wrong results

# Watch context switches per second
vmstat 1 5      # "cs" column shows context switches per interval

# Or use perf
perf stat -e context-switches ls /tmp

🔬 Going Deeper

How Can Go Launch a Million Goroutines?

Go uses an M:N scheduling model: M goroutines are multiplexed onto N kernel threads (N ≈ number of CPU cores). Go's runtime ships its own user-space scheduler based on the GPM model — G for goroutine, P for logical processor, M for OS thread. A new goroutine starts with only a 2 KB stack (vs. 8 MB for a typical OS thread) that grows dynamically as needed. Switching between goroutines happens in user space without a syscall, making one million concurrent goroutines a completely realistic workload.

False Sharing: The Hidden Multi-Thread Performance Trap

Even perfectly correct multi-threaded code can run slower with more cores due to false sharing. Suppose two threads each update a different variable, but both variables happen to land in the same 64-byte CPU cache line. Every write by thread A invalidates thread B's cache line, forcing it to reload from main memory — even though neither thread touched the other's variable. The fix is padding each variable to a separate cache line. This is a particularly insidious bug because the program produces correct results yet scales worse as you add cores.

Recommended Reading:

Java Concurrency in Practice by Brian Goetz — Despite the Java title, the foundational concepts (happens-before ordering, visibility, atomicity) apply to every language. Chapter 2 and 3 alone are worth the cover price.
Operating Systems: Three Easy Pieces (OSTEP) — The concurrency section walks through locks, condition variables, and semaphores with careful worked examples.
The Go blog post The Go scheduler (2013, Dmitry Vyukov) — A few thousand words that will permanently change how you think about lightweight scheduling.

Rate this chapter

4.9 / 5 (13 ratings)

Threads and Concurrency

Threads and Concurrency

Core Concepts

Process vs. Thread: What Is Shared, What Is Not

Concurrency vs. Parallelism: Appearing Simultaneous vs. Actually Simultaneous

The Cost of a Context Switch

User-Mode Threads vs. Kernel-Mode Threads

Coroutines: The Lightweight Alternative

Why Python Has the GIL

Hands-On Verification

🔬 Going Deeper

💬 Comments