Threads and Concurrency
Threads and Concurrency
Imagine a restaurant kitchen with a single chef. Every dish must wait until the previous one is finished — hopelessly slow. Now hire five chefs who share the same kitchen: same stoves, same pantry, same utensils. Each chef has their own cutting board and their own pace. That shared-kitchen arrangement is exactly what multithreading looks like.
A process is like a whole company. A thread is like an employee inside it. Everyone in the company shares the office, the filing server, and the accounting system — but each person has their own task list and working memory. The benefit of sharing is low communication overhead (pass data directly through shared memory). The risk is that one careless employee can corrupt a shared file and take everyone down with them.
Core Concepts
Process vs. Thread: What Is Shared, What Is Not
┌─────────────────────────────────────────────────────────┐
│ Process Address Space │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Shared by All Threads │ │
│ │ .text (code) .data (globals) Heap │ │
│ │ File descriptors Signal table Page tables │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Thread 1 Thread 2 Thread 3 │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Own Stack │ │Own Stack │ │Own Stack │ │
│ │Thread ID │ │Thread ID │ │Thread ID │ │
│ │Registers │ │Registers │ │Registers │ │
│ │Prog Ctr │ │Prog Ctr │ │Prog Ctr │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
Each thread owns privately:
- Stack — local variables and call frames, invisible to other threads
- Program Counter (PC) — which instruction is currently executing
- Register state — the CPU snapshot saved during a context switch
All threads share:
- Code segment (all threads execute the same code)
- Heap (memory from malloc is accessible to any thread)
- Global and static variables
- Open file handles and network connections
Concurrency vs. Parallelism: Appearing Simultaneous vs. Actually Simultaneous
These two words are routinely conflated. They mean different things:
Concurrency: one CPU rapidly switches between tasks (illusion of simultaneity)
Timeline:
CPU: ─A─A─B─B─A─A─C─C─B─B─C─C─►
Tasks A, B, C take turns. Only one runs at any instant.
Parallelism: multiple CPUs truly running at the same time
Timeline:
Core 0: ──A──A──A──A──►
Core 1: ──B──B──B──B──►
Core 2: ──C──C──C──C──►
A, B, C execute simultaneously on different cores.
If your Python program has 4 threads but you only have 1 physical CPU core, you have concurrency — not parallelism. Concurrency is a design strategy (how you structure your code). Parallelism is a hardware execution reality (how many cores are actually working at once). A well-designed concurrent program automatically gains parallelism when run on a multi-core machine.
The Cost of a Context Switch
Thread switching is not free. Every time the OS switches threads, it must:
- Save the current thread's registers (general-purpose registers + PC + status flags) to memory
- Restore the next thread's registers from memory
- If switching to a different process, also flush the TLB (translation lookaside buffer)
One context switch costs roughly 1–10 microseconds. That sounds tiny, but with 1,000 threads context-switching constantly, the overhead itself becomes the bottleneck. This is one of the reasons coroutines became so popular.
User-Mode Threads vs. Kernel-Mode Threads
User-mode threads (M:1 model)
Many user threads mapped to one kernel thread
┌──────────────────────────────┐
│ Thread library (user sched) │
│ Thread A Thread B Thread C│
└──────────────┬───────────────┘
│ (just 1)
┌──────────────▼───────────────┐
│ OS kernel thread │
└──────────────────────────────┘
Pro: switching is fast (no syscall)
Con: one blocked thread blocks all
Kernel-mode threads (1:1 model — what Linux uses)
Thread A ──► Kernel thread 1 ──► Core 0
Thread B ──► Kernel thread 2 ──► Core 1
Thread C ──► Kernel thread 3 ──► Core 0/1
Pro: true parallelism; one blocked thread doesn't affect others
Con: creation and switching require syscalls
Coroutines: The Lightweight Alternative
A coroutine is a unit of execution that yields control voluntarily — no kernel involvement, no system call, no context switch overhead. Switching between coroutines costs nanoseconds rather than microseconds. Python's async/await, Go's goroutines, and Rust's async are all built on this idea.
import asyncio
async def boil_water():
print("Boiling water...")
await asyncio.sleep(3) # yield control; don't block the loop
print("Water ready, steeping tea")
async def cut_fruit():
print("Slicing apple...")
await asyncio.sleep(1)
print("Apple sliced")
async def main():
await asyncio.gather(boil_water(), cut_fruit())
asyncio.run(main())
# Both tasks run "concurrently"; total time is 3s, not 4s
Why Python Has the GIL
Python's GIL (Global Interpreter Lock) ensures that only one thread executes Python bytecode at any given moment. Why?
CPython uses reference counting for garbage collection. Every object carries a counter: it increments when something references the object, and decrements when the reference is dropped. When the count reaches zero, memory is freed. If two threads simultaneously modify the same object's reference count, the counter can go corrupt, leading to memory leaks or crashes. The GIL is the blunt-but-effective solution: one big lock prevents concurrent reference-count updates.
The price: Python threads cannot exploit multiple cores for CPU-bound work. The workarounds are multiprocessing (separate processes, each with its own GIL) or C extensions like NumPy (which release the GIL during heavy computation). Python 3.13 began experimenting with removing the GIL via a per-interpreter lock model.
Hands-On Verification
# Count the threads of a running Python process
ps -Lp $(pgrep -n python3)
# Or read it from /proc
cat /proc/$(pgrep -n python3)/status | grep Threads
# Demonstrate shared heap — and a hidden race condition
import threading
counter = 0 # global, shared by all threads
def increment():
global counter
for _ in range(100_000):
counter += 1 # NOT atomic — potential race condition
threads = [threading.Thread(target=increment) for _ in range(5)]
for t in threads: t.start()
for t in threads: t.join()
print(f"Expected: 500000, Got: {counter}")
# Python's GIL masks the race here, but in C or Java without locks
# you would regularly see wrong results
# Watch context switches per second
vmstat 1 5 # "cs" column shows context switches per interval
# Or use perf
perf stat -e context-switches ls /tmp
🔬 Going Deeper
How Can Go Launch a Million Goroutines?
Go uses an M:N scheduling model: M goroutines are multiplexed onto N kernel threads (N ≈ number of CPU cores). Go's runtime ships its own user-space scheduler based on the GPM model — G for goroutine, P for logical processor, M for OS thread. A new goroutine starts with only a 2 KB stack (vs. 8 MB for a typical OS thread) that grows dynamically as needed. Switching between goroutines happens in user space without a syscall, making one million concurrent goroutines a completely realistic workload.
False Sharing: The Hidden Multi-Thread Performance Trap
Even perfectly correct multi-threaded code can run slower with more cores due to false sharing. Suppose two threads each update a different variable, but both variables happen to land in the same 64-byte CPU cache line. Every write by thread A invalidates thread B's cache line, forcing it to reload from main memory — even though neither thread touched the other's variable. The fix is padding each variable to a separate cache line. This is a particularly insidious bug because the program produces correct results yet scales worse as you add cores.
Recommended Reading:
- Java Concurrency in Practice by Brian Goetz — Despite the Java title, the foundational concepts (happens-before ordering, visibility, atomicity) apply to every language. Chapter 2 and 3 alone are worth the cover price.
- Operating Systems: Three Easy Pieces (OSTEP) — The concurrency section walks through locks, condition variables, and semaphores with careful worked examples.
- The Go blog post The Go scheduler (2013, Dmitry Vyukov) — A few thousand words that will permanently change how you think about lightweight scheduling.