The GMP Scheduler: How Goroutines Run
Chapter 14: The GMP Scheduler — How Goroutines Run
The moment you write go func(){}(), the Go runtime faces a fundamental question: who executes this function? On which CPU core? When? This is not a simple thread creation — Go must efficiently schedule tens of thousands of goroutines onto a limited number of operating system threads. The mechanism that solves this problem is the GMP scheduler.
The GMP scheduler is the foundation of Go's concurrency capabilities. Understanding it reveals why goroutines are lighter than threads, why Go can effortlessly handle millions of concurrent tasks, why some goroutines can "starve," and why setting GOMAXPROCS correctly matters so much.
This chapter starts from intuition and progressively descends into source-level implementation and design decisions, building a complete mental model of the scheduler.
Level 1: What You Need to Know
1.1 Why Goroutines Are Lighter Than Threads
Every Go beginner hears the mantra: "goroutines are lightweight." But lightweight in what way exactly? Let's speak in concrete numbers.
Stack Size Comparison:
| Dimension | OS Thread | Goroutine |
|---|---|---|
| Initial stack size | 1-8 MB (typically 8 MB on Linux) | 2 KB (Go 1.4+) |
| Stack growth | Fixed size, allocated at creation | Dynamic, grows on demand up to 1 GB |
| Creation cost | ~1-10 μs (involves syscall) | ~0.3 μs (pure userspace) |
| Context switch | ~1-10 μs (kernel trap) | ~0.2 μs (userspace switch) |
| Memory for 10,000 | 80 GB (impossible) | 20 MB (trivial) |
Three key technical differences drive these numbers:
First, goroutines use dynamically growing stacks. An OS thread's stack size is fixed at creation because the kernel cannot safely relocate stack frames at runtime. A goroutine's stack is managed by the Go runtime — starting at just 2 KB. When function call depth increases and the runtime detects insufficient stack space, it allocates a new stack twice the size, copies the old stack contents over, and adjusts all pointers to the old stack. This is the "stack copying" mechanism introduced in Go 1.3, replacing the earlier "segmented stack" approach.
// Verifying goroutine initial stack size
package main
import (
"fmt"
"runtime"
"sync"
)
func main() {
var wg sync.WaitGroup
n := 100000
wg.Add(n)
var m runtime.MemStats
runtime.ReadMemStats(&m)
before := m.Sys
for i := 0; i < n; i++ {
go func() {
select {} // block to keep goroutine alive
}()
}
runtime.ReadMemStats(&m)
after := m.Sys
fmt.Printf("Created %d goroutines\n", n)
fmt.Printf("Memory growth: %.2f MB\n", float64(after-before)/1024/1024)
fmt.Printf("Per goroutine: ~%.2f KB\n", float64(after-before)/float64(n)/1024)
}
Second, goroutine switches don't require kernel traps. An OS thread context switch transitions from user mode to kernel mode (via syscall or interrupt), saves/restores the full CPU register set (including floating-point, SSE/AVX registers), and may flush TLB entries. A goroutine switch happens entirely in userspace, saving only a handful of registers (SP, PC, and a few callee-saved registers) — roughly 40-50 bytes of state total.
Third, goroutine creation requires no system call. Creating an OS thread requires the clone() syscall (Linux) or CreateThread() (Windows), meaning a kernel trap, allocation of kernel data structures, TLS setup, and more. Creating a goroutine only needs to grab a g struct from the free pool (or malloc one) in userspace, set up its stack and entry function, and place it on a run queue — no syscall anywhere in the path.
1.2 Intuitive Understanding of G/M/P
The GMP model can be understood through a factory analogy:
┌──────────────────────────────────────────────────────────┐
│ Factory (Go Process) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Station P0│ │ Station P1│ │ Station P2│ ... │
│ │ │ │ │ │ │ │
│ │ Local Queue│ │ Local Queue│ │ Local Queue│ │
│ │ [G][G][G]│ │ [G][G] │ │ [G] │ │
│ │ │ │ │ │ │ │
│ │ Worker M0 │ │ Worker M1 │ │ Worker M2 │ │
│ │ (active) │ │ (active) │ │ (active) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Global Task Queue: [G][G][G][G][G]... │
│ │
│ Break Room (idle workers): M3, M4, M5 ... │
└──────────────────────────────────────────────────────────┘
G (Goroutine) = Work Order. Each G represents a function to be executed, containing the function entry point, arguments, stack space, and current execution state. It is the smallest schedulable unit. Corresponds to runtime.g in the runtime.
M (Machine) = Worker. Each M maps to one OS thread. M is the entity that actually executes code — the CPU only understands threads, not goroutines. M picks a G from P's queue and runs it. Corresponds to runtime.m in the runtime.
P (Processor) = Workstation. P is a logical concept representing "the resources needed to execute Go code." Each P owns a local run queue, a memory cache (mcache), and other scheduling state. An M must hold a P to execute goroutines. The number of Ps is determined by GOMAXPROCS. Corresponds to runtime.p in the runtime.
Why is P necessary? You might ask: since M is the thread and G is the task, why not have M pull G directly from a global queue? The answer is performance. If all Ms pulled from a single global queue, that queue would need a mutex, and under high concurrency the lock contention would severely degrade performance. With P, each M binds to a P and preferentially dequeues from P's local queue (a lock-free operation). Only when the local queue is empty does it go to the global queue or steal from other Ps.
// Inspecting current GMP state
package main
import (
"fmt"
"runtime"
)
func main() {
fmt.Printf("GOMAXPROCS (number of Ps): %d\n", runtime.GOMAXPROCS(0))
fmt.Printf("NumCPU (CPU cores): %d\n", runtime.NumCPU())
fmt.Printf("NumGoroutine (active Gs): %d\n", runtime.NumGoroutine())
}
1.3 GOMAXPROCS: Meaning and Configuration
GOMAXPROCS determines the maximum number of threads simultaneously executing Go code — i.e., the number of Ps. Note the keyword "simultaneously" — this is not a limit on the number of goroutines, nor on the number of threads, but a limit on parallelism.
Default value: Since Go 1.5, GOMAXPROCS defaults to the number of CPU cores. Before that, it defaulted to 1, meaning all goroutines could only be concurrent but never parallel.
Setting methods:
// Method 1: Environment variable
// GOMAXPROCS=4 go run main.go
// Method 2: In code
import "runtime"
func init() {
runtime.GOMAXPROCS(4) // returns the previous value
}
// Method 3: Query current value
current := runtime.GOMAXPROCS(0) // passing 0 queries without modifying
Common misconceptions:
-
Misconception: Higher GOMAXPROCS is always better. For CPU-bound tasks, setting it to the number of CPU cores is optimal. Exceeding core count only adds scheduling overhead and cache misses. For I/O-bound tasks, you can increase it slightly, but the default usually suffices.
-
Misconception: GOMAXPROCS limits thread count. It doesn't. The number of Ms (OS threads) can far exceed the number of Ps. When goroutines block on syscalls, the runtime creates new Ms to keep Ps busy. The default maximum thread count is 10,000 (adjustable via
runtime/debug.SetMaxThreads). -
Misconception: GOMAXPROCS auto-adapts in containers. The Go runtime reads the host machine's CPU core count, not the container's CPU quota. In Kubernetes, a Pod limited to 2 cores running on a 64-core host will have
GOMAXPROCS=64, causing severe scheduling overhead. The solution is theuber-go/automaxprocslibrary.
// Recommended approach for containerized environments
import _ "go.uber.org/automaxprocs" // auto-sets based on CFS quota
func main() {
// GOMAXPROCS is now automatically set to container CPU limit
}
1.4 Goroutine Lifecycle
A goroutine transitions through these states from creation to termination:
┌─────────────────────────────────────┐
│ ▼
Create(_Gidle) ──→ Runnable(_Grunnable) ──→ Running(_Grunning) ──→ Dead(_Gdead)
▲ │
│ ▼
└──── Wake ◄──── Waiting(_Gwaiting)
│
▼
Syscall(_Gsyscall)
State details:
| State | Meaning | Typical trigger |
|---|---|---|
_Gidle |
Just allocated, uninitialized | runtime.newproc allocates G struct |
_Grunnable |
Ready, waiting to be scheduled | Creation complete / woken from blocking |
_Grunning |
Executing on some M/P | Selected by scheduler |
_Gwaiting |
Blocked waiting for an event | Channel ops / select / time.Sleep |
_Gsyscall |
Executing a system call | File I/O / network (non-netpoller) |
_Gdead |
Finished or unused | Function return / panic |
Creation process:
When you write go f(args), the compiler transforms it into a call to runtime.newproc:
// Compiler transformation:
// go f(x, y) → runtime.newproc(f, x, y)
// Simplified creation flow:
// 1. Get an idle G from current P's gFree list (reuse), or malloc a new one
// 2. Set up G's stack, entry function (fn), and arguments
// 3. Set G's status to _Grunnable
// 4. Place G at the tail of current P's local run queue
// 5. If there's an idle P and no spinning M, wake one M
Blocking and waking:
Goroutine blocking differs fundamentally from thread blocking. When a goroutine blocks on a channel operation:
- G's state becomes
_Gwaiting - G is detached from M and placed into the channel's wait queue
- M does NOT block — it immediately picks the next G from P's queue
- When the other end of the channel is ready, the blocked G is placed back onto some P's run queue
This is the core advantage of goroutines: G blocking does not waste M (thread) resources.
// Observing goroutine state transitions
package main
import (
"fmt"
"runtime"
"time"
)
func main() {
fmt.Printf("Goroutines at start: %d\n", runtime.NumGoroutine())
ch := make(chan struct{})
go func() {
// State: _Grunnable → _Grunning
fmt.Println("goroutine running")
<-ch // State: _Grunning → _Gwaiting (blocked on channel)
fmt.Println("goroutine woken")
// After return: _Grunning → _Gdead
}()
time.Sleep(100 * time.Millisecond)
fmt.Printf("Goroutines while blocked: %d\n", runtime.NumGoroutine())
ch <- struct{}{} // Wake: G state _Gwaiting → _Grunnable → _Grunning
time.Sleep(100 * time.Millisecond)
fmt.Printf("Goroutines after completion: %d\n", runtime.NumGoroutine())
}
System call scenario:
When a goroutine enters a system call (e.g., file I/O), the situation differs:
- G's state becomes
_Gsyscall - M is blocked in the kernel (unavoidable — the kernel doesn't provide universal async file I/O)
- P is unbound from M (handoff) and bound to another idle M (or a new M is created)
- When the syscall returns, M attempts to reacquire its previous P; if P is taken, M puts G into the global queue and goes to sleep
This guarantees that even if goroutines are stuck in long syscalls, scheduling of other goroutines is unaffected.
1.5 Common Errors and Fixes
Error 1: Goroutine leak
// Bug: nobody sends to ch, goroutine blocks forever
func leak() {
ch := make(chan int)
go func() {
val := <-ch // blocks forever, goroutine cannot be GCed
fmt.Println(val)
}()
// Function returns, ch is unreachable, but goroutine is still waiting
}
// Fix 1: Use context to control lifecycle
func noLeak(ctx context.Context) {
ch := make(chan int)
go func() {
select {
case val := <-ch:
fmt.Println(val)
case <-ctx.Done():
return // exit on timeout or cancellation
}
}()
}
// Fix 2: Use buffered channel
func noLeak2() {
ch := make(chan int, 1) // sender won't block even if nobody reads
go func() {
ch <- 42
}()
}
Error 2: Ignoring GOMAXPROCS impact on CPU-bound tasks
// Running CPU-intensive computation on a 4-core machine
// If GOMAXPROCS=1, these goroutines are concurrent but NOT parallel
func compute() {
var wg sync.WaitGroup
for i := 0; i < 4; i++ {
wg.Add(1)
go func() {
defer wg.Done()
sum := 0
for j := 0; j < 1_000_000_000; j++ {
sum += j
}
}()
}
wg.Wait()
}
// GOMAXPROCS=1: ~12s
// GOMAXPROCS=4: ~3s (linear speedup)
Level 2: How It Works Under the Hood
2.1 The Complete Scheduling Workflow
To understand the full workflow, we need to examine runtime.schedule(). This is the scheduler's core loop — every M enters this function when it needs a new G to execute.
schedule() execution flow:
┌────────────────────────────────────────────────────────┐
│ schedule() │
│ │ │
│ ├─ 1. If current G is locked to M, handle LockOSThread│
│ │ │
│ ├─ 2. Find a runnable G (findRunnable) │
│ │ ├─ Check local run queue │
│ │ ├─ Check global run queue │
│ │ ├─ Check netpoller │
│ │ ├─ Try work stealing │
│ │ └─ Nothing found? Block and wait │
│ │ │
│ ├─ 3. execute(gp) — switch to target G's context │
│ │ │
│ └─ 4. G completes / blocks / preempted → back to │
│ schedule() │
└────────────────────────────────────────────────────────┘
The specific lookup order in findRunnable (from runtime/proc.go):
// Simplified findRunnable logic (Go 1.22)
func findRunnable() (gp *g, inheritTime, tryWakeP bool) {
pp := getg().m.p.ptr()
// 1. Every 61st schedule tick, check global queue (prevent starvation)
if pp.schedtick%61 == 0 && sched.runqsize > 0 {
lock(&sched.lock)
gp := globrunqget(pp, 1) // grab 1 from global queue
unlock(&sched.lock)
if gp != nil {
return gp, false, false
}
}
// 2. Get from local run queue
if gp, inheritTime := runqget(pp); gp != nil {
return gp, inheritTime, false
}
// 3. Get from global run queue
if sched.runqsize != 0 {
lock(&sched.lock)
gp := globrunqget(pp, 0)
unlock(&sched.lock)
if gp != nil {
return gp, false, false
}
}
// 4. Get ready network I/O goroutines from netpoller
if netpollinited() && netpollAnyWaiters() && sched.lastpoll.Load() != 0 {
if list, delta := netpoll(0); !list.empty() {
gp := list.pop()
injectglist(&list) // put remainder in local/global queue
return gp, false, false
}
}
// 5. Try stealing from other Ps (Work Stealing)
for i := 0; i < 4; i++ { // up to 4 rounds
for enum := stealOrder.start(fastrand()); !enum.done(); enum.next() {
p2 := allp[enum.position()]
if gp := runqsteal(pp, p2, stealRunNextG); gp != nil {
return gp, false, false
}
}
}
// 6. Nothing available — go to sleep
stopm() // put M in idle list, unbind P
}
When does the scheduler run? Not all code runs uninterrupted. The Go scheduler gets execution opportunities at these points:
- Goroutine voluntarily yields: channel ops, mutex locks, time.Sleep, runtime.Gosched()
- Stack check on function calls: the compiler inserts
morestackchecks at function entries, which also serve as preemption checkpoints - Before/after system calls: entering/exiting syscalls gives the scheduler a chance to adjust M/P bindings
- Asynchronous preemption signal: Go 1.14+ uses signals to forcibly interrupt long-running Gs
2.2 Work Stealing: When P's Local Queue Is Empty
Work Stealing is the core mechanism by which the GMP scheduler maintains load balance. When a P's local queue is empty, it doesn't sit idle — it steals work from other Ps.
Work Stealing illustration:
P0's perspective:
┌──────────┐ Local queue empty! ┌──────────┐
│ P0 │ ── pick random P ────→ │ P2 │
│ queue:[ ]│ │queue:[G G G G]│
│ │ ◄── steal half (2 Gs) ── │ │
│ queue:[G G]│ │ queue:[G G] │
└──────────┘ └──────────┘
Work Stealing rules:
-
How much to steal? Half of the target P's local queue. If the target has 6 Gs, steal 3. This ensures both sides have work to do.
-
From whom? A random starting P is chosen, then all Ps are iterated. The random start prevents all idle Ps from targeting the same victim.
-
Which end of the queue? P's local run queue is a lock-free ring buffer (size 256). The head is the next G to execute; the tail contains the most recently enqueued G. Stealing occurs from the tail — leveraging locality: tail Gs were most recently enqueued and likely haven't built cache affinity on the target P yet, making them cheaper to migrate.
-
What else can be stolen? Besides queue Gs, the target's
runnextcan also be stolen — this is a special "about to run" G pointer, controlled by thestealRunNextGflag.
// Observing Work Stealing behavior
package main
import (
"fmt"
"runtime"
"sync"
"sync/atomic"
)
func main() {
runtime.GOMAXPROCS(4)
var counters [4]atomic.Int64
var wg sync.WaitGroup
// All goroutines created from one P's context
// Observe that they get distributed via work stealing
for i := 0; i < 1000; i++ {
wg.Add(1)
go func() {
defer wg.Done()
sum := 0
for j := 0; j < 1_000_000; j++ {
sum += j
}
}()
}
wg.Wait()
fmt.Println("Work stealing ensures load balance across", runtime.GOMAXPROCS(0), "Ps")
_ = counters
}
2.3 M/P Separation During System Calls (Handoff)
When a goroutine executes a system call, the entire M (OS thread) is blocked by the kernel. Without intervention, the P bound to that M would also sit idle, wasting a parallelism slot.
The handoff mechanism:
Before syscall: During syscall: After syscall return:
M0 ─── P0 M0 (blocked) M0 ─── P0 (if P0 free)
│ ↓ or
G1 (syscall) P0 unbound M0 → puts G1 in global queue
↓ M0 goes idle
M2 ─── P0
│
G3 (running)
Detailed steps:
- Before entering syscall:
runtime.entersyscall()is called. G's state →_Gsyscall, P's state →_Psyscall. - sysmon detection: The sysmon thread periodically checks all Ps in
_Psyscallstate. If a syscall has lasted more than 20μs (or one sysmon tick), it performs handoff — unbinding P from M and giving it to another idle M or creating a new M. - Syscall return: M tries to reacquire its previous P. If P is occupied by another M, M tries to acquire any idle P. If no idle P is available, M puts G into the global run queue and goes to sleep.
// Observing thread growth from system calls
package main
import (
"fmt"
"os"
"runtime"
"sync"
"time"
)
func main() {
runtime.GOMAXPROCS(2) // only 2 Ps
fmt.Printf("Initial: GOMAXPROCS=%d\n", runtime.GOMAXPROCS(0))
var wg sync.WaitGroup
for i := 0; i < 10; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
f, _ := os.Open("/dev/zero")
buf := make([]byte, 1)
f.Read(buf) // blocking syscall
time.Sleep(time.Second)
f.Close()
}(i)
}
time.Sleep(100 * time.Millisecond)
fmt.Printf("NumGoroutine: %d\n", runtime.NumGoroutine())
// More than 2 Ms should be created (syscalls block original Ms)
wg.Wait()
}
Special handling of network I/O — the netpoller:
Go applies a special optimization for network I/O using epoll (Linux) / kqueue (macOS) / IOCP (Windows) for non-blocking I/O. When a goroutine performs network read/write:
- The underlying fd is set to non-blocking mode
- If I/O isn't ready, G is parked on the netpoller's wait list (state
_Gwaiting) - M is NOT blocked — it immediately executes other Gs
- The netpoller is checked during
findRunnable, and ready Gs are placed back on run queues
This is why Go network servers can handle massive concurrent connections with few threads — goroutines blocking on network I/O consume zero OS thread resources.
2.4 Preemptive Scheduling
Cooperative preemption (Go 1.13 and earlier):
Before Go 1.14, the scheduler relied on "cooperative preemption." The compiler inserts stack growth checks (morestack) at function entries. The scheduler signals a preemption request by setting G's stackguard0 field to a special sentinel value. The next time G calls a function, the stack check fires, discovers the preemption mark, and voluntarily yields the CPU.
The problem: If a goroutine executes a tight loop without function calls, it never checks the preemption mark, and other goroutines starve:
// Before Go 1.14, this goroutine is never preempted
go func() {
for {
// Pure computation, no function calls
// No preemption check point
x++
}
}()
// Other goroutines may starve
Signal-based asynchronous preemption (Go 1.14+):
Go 1.14 introduced signal-based asynchronous preemption (proposal #24543, by Austin Clements):
- sysmon thread detects that a G has been running for more than 10ms
- sysmon sends
SIGURGto the target M (SIGURG was chosen because it doesn't interfere with debuggers or standard signal handlers) - M's signal handler
sighandlerreceives the signal - The signal handler checks if execution is at a safe point; if so, it modifies G's PC register to point to
asyncPreempt - After the signal handler returns, G actually jumps to
asyncPreempt, saves all register state, then callsschedule()to yield
Async preemption flow:
sysmon detects: G running > 10ms
│
▼
Send SIGURG to M
│
▼
M's signal handler takes over
│
▼
At safe point?
├── No → skip, try again later
└── Yes → set G.pc = asyncPreempt
│
▼
signal return → execute asyncPreempt
│
▼
save all registers → gopreempt_m() → schedule()
The concept of safe points:
Not every moment is safe for preempting a goroutine. For example, if G is executing within a non-preemptible region of the runtime (such as certain GC marking operations), forced preemption could lead to inconsistent state. The Go runtime checks these conditions to determine if it's at a safe point:
- Not in
_Gsyscallstate - Not holding runtime-internal locks
- Stack frame information is available (can generate correct stack maps for GC)
2.5 The sysmon Monitor Thread
sysmon is a special daemon thread in the Go runtime — it doesn't bind to any P, runs independently, and serves as the runtime's "watchdog."
// sysmon's main responsibilities (simplified from runtime/proc.go)
func sysmon() {
idle := 0
delay := uint32(0)
for {
// Adaptive sleep: min 20μs when busy, max 10ms when idle
if idle == 0 {
delay = 20 // 20μs
} else if idle > 50 {
delay = 10000 // 10ms
}
usleep(delay)
// 1. Network polling: if >10ms since last netpoll, do a non-blocking poll
lastpoll := sched.lastpoll.Load()
if netpollinited() && lastpoll != 0 && lastpoll+10*1000*1000 < now {
sched.lastpoll.Store(now)
list, _ := netpoll(0) // non-blocking
if !list.empty() {
injectglist(&list) // put ready Gs in global queue
}
}
// 2. Preempt long-running Gs
retake(now) // check all Ps, preempt or handoff
// 3. Force GC: if >2 minutes without GC, force trigger
if t := (gcTrigger{kind: gcTriggerTime, now: now}); t.test() {
forcegc.g.schedlink = 0
injectglist(&forcegc.g)
}
// 4. Scavenge: return long-unused heap memory to OS
}
}
sysmon's retake function — preemption and handoff logic:
func retake(now int64) uint32 {
n := 0
for i := 0; i < len(allp); i++ {
pp := allp[i]
pd := &pp.sysmontick
s := pp.status
if s == _Prunning || s == _Psyscall {
// For running Ps: if G has run > forcePreemptNS (10ms)
// set preemption flag
t := int64(pp.schedtick)
if pd.schedtick != t {
pd.schedtick = t
pd.schedwhen = now
} else if pd.schedwhen+forcePreemptNS <= now {
preemptone(pp) // send preemption signal
n++
}
}
if s == _Psyscall {
// For Ps in syscall: if exceeds one sysmon tick
// and local queue is non-empty or no idle P available, handoff
if runqempty(pp) && sched.nmspinning.Load()+sched.npidle.Load() > 0 {
continue // no need to handoff
}
if pd.syscallwhen+10*1000*1000 > now {
continue // not timed out yet
}
handoffp(pp) // unbind P from blocked M
n++
}
}
return uint32(n)
}
sysmon's running frequency:
sysmon does not run at a fixed frequency. It uses an adaptive sleep strategy:
- Initially checks every 20μs
- If no events are found consecutively, gradually increases the interval
- Maximum interval is 10ms
- As soon as an event is detected, the interval drops immediately
This balances responsiveness and CPU overhead — checking frequently when the system is busy, reducing overhead when idle.
2.6 Locality Optimizations in the Scheduler
The Go scheduler includes several locality optimizations to reduce cache misses and improve performance:
runnext optimization: Each P has a runnext field pointing to "the next G that should run." When a G creates a new G (go func()), the new G isn't placed at the queue tail but set as the current P's runnext. This means the new G is scheduled immediately, exploiting producer-consumer locality (the data produced is still in cache when the consumer runs).
P affinity: When a G is woken from blocking, the runtime preferentially places it back on the P where it last ran, leveraging existing cache warmth.
Lock-free local queue: P's local run queue is a 256-element ring array implementing a lock-free single-producer multi-consumer queue (only the owning M pushes, but other Ps can steal).
Level 3: What the Specification Defines
3.1 Evolution of the GMP Model: From GM to GMP
Go 1.0's GM model:
Go's original scheduler (Go 1.0) was very simple: only G and M, no P concept.
Go 1.0 scheduler:
┌──────────────────────────────┐
│ Global Run Queue │
│ [G] [G] [G] [G] [G] │
│ (mutex protected) │
└──────────┬───────────────────┘
│
┌──────┼──────┐
▼ ▼ ▼
M0 M1 M2
│ │ │
G G G
This model had severe problems:
- Global queue lock contention: Every M contested the global queue's mutex each time it needed a new G. Under high concurrency, this lock became a severe bottleneck.
- G migration destroying locality: When M0 creates a new G, that G enters the global queue and might be picked up by M1. If the new G needs to access data M0 just processed, cache locality is destroyed.
- Frequent M blocking and waking: Every M blocked by a syscall must contest the global lock again upon waking.
- Memory allocator contention: Go's memory allocator (mcache) was bound to M. Since M count inflates due to syscalls, many Ms holding their own mcache caused memory waste.
Dmitry Vyukov's GMP redesign (Go 1.1, 2013):
In March 2012, Google's Dmitry Vyukov submitted his landmark design document "Scalable Go Scheduler Design Doc" (https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw). This document analyzed the four core deficiencies of the GM model and proposed introducing P as the solution.
What P solves:
| GM Model Problem | How P Solves It |
|---|---|
| Global queue lock contention | Each P has a lock-free local queue; most operations need no global lock |
| Poor cache locality | G preferentially runs on the P that created it |
| M inflation wastes mcache | mcache bound to P (fixed count), not M |
| Thread blocking wastes parallelism slot | P unbinds from blocked M, immediately assigned to another M |
Key numeric choices in the design:
- P's local queue size is 256 (power of 2 for efficient modulo; large enough to reduce global queue interactions)
- Preemption threshold is 10ms (balances latency and throughput — too short means frequent switching, too long means other Gs wait too long)
- Work stealing takes half (from the Blumofe-Leiserson balanced strategy)
- Global queue is checked every 61 schedule ticks (61 is prime, avoiding resonance with fixed patterns in programs)
3.2 Why P Was Needed — Eliminating Global Lock Contention
Let's quantify P's value with performance data. Dmitry Vyukov provided these benchmark results in his design document (on an 8-core machine):
benchmark old ns/op new ns/op speedup
BenchmarkCreateGoroutine 2080 480 4.3x
BenchmarkCreateGoroutineIdle 1010 66 15.3x
BenchmarkOsYield 7700 5700 1.4x
BenchmarkPing 46000 27000 1.7x
Goroutine creation speed improved 4-15x, primarily from eliminating global queue lock contention.
Lock-free local queue implementation details:
P's local run queue uses a classic single-producer multi-consumer lock-free ring buffer:
type p struct {
// ...
runqhead uint32 // atomic read/write
runqtail uint32 // only owner writes
runq [256]guintptr // ring buffer
runnext guintptr // atomic operations
}
// Enqueue (only M owning this P calls this)
func runqput(pp *p, gp *g, next bool) {
if next {
// Set as runnext (atomic CAS)
oldnext := pp.runnext
if !pp.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
retry...
}
if oldnext == 0 {
return
}
gp = oldnext.ptr() // old runnext goes into queue
}
h := atomic.LoadAcq(&pp.runqhead)
t := pp.runqtail
if t-h < uint32(len(pp.runq)) {
pp.runq[t%uint32(len(pp.runq))].set(gp)
atomic.StoreRel(&pp.runqtail, t+1) // ensure G is visible to stealers
return
}
// Local queue full — move half to global queue
runqputslow(pp, gp, h, t)
}
Key points:
runqtailis only written by the owner — no CAS neededrunqheadrequires atomic operations because stealers modify itatomic.LoadAcqandatomic.StoreRelensure memory ordering- When local queue is full (256 entries), half the Gs move to the global queue — this is the "overflow" mechanism
3.3 Theoretical Foundation of Work Stealing
Work Stealing's theoretical foundation comes from Robert D. Blumofe and Charles E. Leiserson's 1999 paper "Scheduling Multithreaded Computations by Work Stealing" (Journal of the ACM, Vol. 46, No. 5).
Core theorem: For a parallel computation with total work T₁ and critical path length T∞, a work-stealing scheduler using P processors achieves expected execution time:
E[Tp] ≤ T₁/P + O(T∞)
Where:
- T₁ is the serial execution time (sum of all tasks)
- T∞ is the critical path length (longest dependency chain)
- P is the number of processors
What does this bound mean? It proves work stealing is "near-optimal" — execution time consists of two parts: ideal parallel amortization (T₁/P) plus unavoidable serial dependencies (T∞). Theoretically, no scheduler can do better (the lower bound for any scheduler is max(T₁/P, T∞)).
Work Stealing vs Work Sharing comparison:
| Property | Work Stealing | Work Sharing |
|---|---|---|
| When tasks migrate | Idle processor actively steals | Creator actively distributes |
| Communication overhead | Only communicates when idle | Communicates on every G creation |
| Cache locality | Good (G tends to run on creator) | Poor (G immediately distributed remotely) |
| Load balance latency | May have brief imbalance | More immediate balance |
| Best for | Frequent task creation, uneven durations | Infrequent creation, uniform durations |
Go chose work stealing because goroutine creation is extremely frequent (potentially millions per second). If every creation involved distribution, communication overhead would overwhelm the system.
Another key conclusion from Blumofe-Leiserson — steal attempt count:
The theory proves that the expected total number of steal attempts throughout a computation is O(P · T∞). This means steal operations (which access remote P queues and incur communication cost) scale linearly with processor count but are independent of total work. For Go programs, if dependency chains between goroutines are short (small T∞), steals are infrequent, and most of the time each P executes locally.
3.4 Comparison with Other Runtime Schedulers
Erlang BEAM scheduler:
Erlang's BEAM virtual machine uses a model similar to Go's but predates it:
- One scheduler thread per CPU core (equivalent to P)
- Each scheduler has its own run queue
- Uses a hybrid of work stealing and work sharing
- Key difference from Go: Erlang processes share absolutely no memory — message passing is the sole communication mechanism, so the scheduler can migrate processes without considering cache coherence costs
Erlang BEAM: Go GMP:
Scheduler 1 ─── RunQueue P0 ─── LocalRunQueue
Scheduler 2 ─── RunQueue P1 ─── LocalRunQueue
Scheduler 3 ─── RunQueue P2 ─── LocalRunQueue
GlobalRunQueue
Migration queues Work Stealing
Reduction/compaction on low load Spinning M waiting
Erlang's unique approach is "reduction counting" for preemption: each process runs approximately 4000 reductions (roughly corresponding to function calls and BIF invocations) before being forcibly switched. This is more precise than Go's 10ms time slice because it doesn't depend on clock interrupts.
Java Virtual Threads (Project Loom, JDK 19+):
Java's Virtual Threads were directly inspired by Go goroutines, adopting a very similar M:N scheduling model:
- Platform Thread ≈ M (OS thread)
- Virtual Thread ≈ G (userspace fiber)
- ForkJoinPool ≈ Collection of Ps
// Java Virtual Thread example
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
for (int i = 0; i < 100_000; i++) {
executor.submit(() -> {
Thread.sleep(Duration.ofSeconds(1));
return "done";
});
}
}
Key differences:
| Dimension | Go goroutine | Java Virtual Thread |
|---|---|---|
| Scheduler | Custom GMP | ForkJoinPool (work stealing) |
| Preemption | Signal-based async (10ms) | Cooperative only (safepoint) |
| Stack | Contiguous, copy-on-grow | Frames stored on heap |
| Pinning | LockOSThread | synchronized blocks pin |
| Structured concurrency | No native support | StructuredTaskScope |
| Maturity | Since 2012 (10+ years) | GA in 2023 (JDK 21) |
A known issue with Java Virtual Threads is "pinning" — when a virtual thread enters a synchronized block or executes native methods, it gets pinned to its carrier thread and cannot be unmounted by the scheduler. This is analogous to Go's LockOSThread().
Rust tokio runtime:
Rust's async runtime tokio also uses a work-stealing scheduler, but the model is fundamentally different:
// Rust tokio example
#[tokio::main]
async fn main() {
let handles: Vec<_> = (0..100_000).map(|i| {
tokio::spawn(async move {
tokio::time::sleep(Duration::from_secs(1)).await;
i
})
}).collect();
for handle in handles {
handle.await.unwrap();
}
}
| Dimension | Go goroutine | Rust tokio task |
|---|---|---|
| Model | Stackful coroutine | Stackless coroutine |
| Memory per task | ~2-8 KB (stack) | ~tens of bytes (Future state machine) |
| Yield mechanism | Runtime implicitly manages | Must explicitly .await |
| Preemption | Yes (Go 1.14+) | No (fully cooperative) |
| Compile-time guarantees | No Send/Sync checking | Compile-time Send + 'static check |
| Blocking handling | Automatic handoff | Must use explicit spawn_blocking |
tokio's core difference is the stackless coroutine model — Rust's async/await is compiled into state machines by the compiler, requiring no stack allocation per task. This yields higher memory efficiency, but the cost is that every I/O point requires an explicit await, and preemption is impossible. If a task performs a blocking operation without spawn_blocking, it blocks the entire worker thread.
3.5 Key Design Trade-offs in the Scheduler
Go's scheduler design is full of deliberate trade-offs:
1. Fairness vs Throughput:
The 10ms preemption threshold is a compromise. Shorter time slices provide better fairness (low latency) but increase switching overhead (lower throughput). Linux's CFS scheduler has a default time slice of ~6ms (determined by sched_latency and sched_min_granularity); Go's 10ms is relatively generous, favoring throughput.
2. Queue Size vs Overflow Frequency:
The 256-element local queue is another trade-off. A larger queue reduces overflow to the global queue but increases coherence costs when stolen from; a smaller queue makes work stealing more frequent but also faster.
3. Spinning Ms vs Response Latency:
The Go scheduler maintains a small set of "spinning" Ms — they hold a P but aren't executing any G, continuously spinning looking for work. This wastes CPU but reduces wakeup latency. The number of spinning Ms is kept small (no more than half the idle Ps), balancing CPU utilization and responsiveness.
Level 4: Edge Cases and Pitfalls
4.1 Common Causes and Detection of Goroutine Leaks
Goroutine leaks are the most common resource leak in Go programs — forgotten goroutines cannot be garbage collected (their stacks may hold references), causing memory to grow continuously until OOM.
Common leak scenarios:
// Scenario 1: Sending to a channel nobody receives from
func leak1() {
ch := make(chan int)
go func() {
ch <- expensiveComputation() // blocks forever
}()
// Function returns, ch unreachable, but goroutine can't exit
}
// Scenario 2: Receiving from a channel nobody sends to
func leak2(ctx context.Context) error {
results := make(chan *Result)
go func() {
r, err := callExternalService()
if err != nil {
return // Note: returns without sending to channel
}
results <- r
}()
select {
case r := <-results:
return process(r)
case <-ctx.Done():
return ctx.Err()
// If ctx times out, goroutine is still running callExternalService
// Even after function returns, the goroutine won't exit
}
}
// Scenario 3: Forgetting to close channel causes range to block forever
func leak3() {
ch := make(chan int)
go func() {
for v := range ch { // blocks forever because ch is never closed
process(v)
}
}()
ch <- 1
ch <- 2
// forgot close(ch)
}
// Scenario 4: Mutual waiting (goroutine deadlock)
func leak4() {
ch1 := make(chan int)
ch2 := make(chan int)
go func() {
<-ch1 // wait for ch1
ch2 <- 1 // then send to ch2
}()
go func() {
<-ch2 // wait for ch2
ch1 <- 1 // then send to ch1
}()
// Both goroutines wait on each other forever
}
Detection methods:
Method 1: runtime.NumGoroutine() monitoring
// Check for goroutine leaks in tests
func TestNoLeak(t *testing.T) {
before := runtime.NumGoroutine()
// Execute code under test
doSomething()
// Wait for goroutines to exit
time.Sleep(100 * time.Millisecond)
after := runtime.NumGoroutine()
if after > before {
t.Errorf("goroutine leak: before=%d after=%d", before, after)
}
}
Method 2: goleak library (open-sourced by Uber)
import "go.uber.org/goleak"
func TestMain(m *testing.M) {
goleak.VerifyTestMain(m)
}
// Or in individual tests
func TestFoo(t *testing.T) {
defer goleak.VerifyNone(t)
// ... test code ...
}
goleak works by capturing all goroutine stacks at test completion, filtering out known system goroutines (from runtime, testing packages), and failing if any other goroutines remain alive.
Method 3: pprof goroutine profile
import (
"net/http"
_ "net/http/pprof"
)
func main() {
go func() {
http.ListenAndServe(":6060", nil) // expose pprof endpoints
}()
// ...
}
// Visit http://localhost:6060/debug/pprof/goroutine?debug=1
// to see all goroutine stacks
// Or via command line:
// go tool pprof http://localhost:6060/debug/pprof/goroutine
// (pprof) top — see which functions created the most goroutines
// (pprof) traces — see all goroutine call stacks
Method 4: Continuous monitoring + alerting
// Expose goroutine count in Prometheus
import "github.com/prometheus/client_golang/prometheus"
var goroutineGauge = prometheus.NewGaugeFunc(
prometheus.GaugeOpts{
Name: "go_goroutines",
Help: "Number of goroutines that currently exist.",
},
func() float64 { return float64(runtime.NumGoroutine()) },
)
// Alert rule (Prometheus AlertManager):
// alert: GoroutineLeak
// expr: go_goroutines > 10000
// for: 5m
4.2 Performance Impact of Incorrect GOMAXPROCS
Scenario 1: Not adapting to container CPU quota
This is the most common production issue. Suppose a Kubernetes Pod is configured with resources.limits.cpu: "2" (2 cores), but the host has 64 cores:
GOMAXPROCS = 64 (wrong!)
├── Creates 64 Ps
├── 64 Ms compete for 2 cores of CPU time
├── Massive context switching (Linux CFS throttles)
├── Scheduling latency increases
└── Actual throughput WORSE than GOMAXPROCS=2
// Fix
import _ "go.uber.org/automaxprocs" // auto-detects CFS quota in init()
// How automaxprocs works:
// 1. Reads /sys/fs/cgroup/cpu/cpu.cfs_quota_us
// 2. Reads /sys/fs/cgroup/cpu/cpu.cfs_period_us
// 3. GOMAXPROCS = quota / period (rounded up)
Scenario 2: Different strategies for CPU-bound vs I/O-bound
// CPU-bound: GOMAXPROCS = CPU core count (default is optimal)
// Encryption, image processing, numerical simulation, etc.
func cpuBound() {
runtime.GOMAXPROCS(runtime.NumCPU()) // this IS the default
}
// I/O-bound: GOMAXPROCS = CPU core count is already sufficient
// Because goroutines blocking on I/O don't occupy P; P executes other Gs
// No need to increase GOMAXPROCS
// Mixed workloads: special cases may need tuning
// If there are many CGO calls (CGO calls pin M), you may need to increase GOMAXPROCS
Scenario 3: Special uses of GOMAXPROCS=1
// Setting GOMAXPROCS to 1 simplifies concurrency reasoning
runtime.GOMAXPROCS(1)
// All goroutines can only be concurrent, never parallel
// Useful in certain test scenarios (reproducing race conditions)
// But note: this is NOT a substitute for the race detector
Benchmark: GOMAXPROCS impact on different workloads
package main
import (
"crypto/sha256"
"fmt"
"runtime"
"sync"
"time"
)
func benchCPU(procs int) time.Duration {
runtime.GOMAXPROCS(procs)
start := time.Now()
var wg sync.WaitGroup
for i := 0; i < 8; i++ {
wg.Add(1)
go func() {
defer wg.Done()
data := make([]byte, 1024)
for j := 0; j < 100_000; j++ {
sha256.Sum256(data)
}
}()
}
wg.Wait()
return time.Since(start)
}
func main() {
for _, p := range []int{1, 2, 4, 8, 16} {
d := benchCPU(p)
fmt.Printf("GOMAXPROCS=%2d: %v\n", p, d)
}
}
// Typical output (8-core machine):
// GOMAXPROCS= 1: 12.3s
// GOMAXPROCS= 2: 6.2s
// GOMAXPROCS= 4: 3.1s
// GOMAXPROCS= 8: 1.6s ← optimal
// GOMAXPROCS=16: 1.7s ← beyond core count, no improvement
4.3 High-Frequency Interview Questions
Question 1: Draw and explain the GMP model
Key points for your answer:
┌─────────────────────────────────────────────────────────┐
│ Go Process │
│ │
│ ┌───────────────── Global Run Queue ──────────────┐ │
│ │ [G] [G] [G] ... (mutex protected, low freq) │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ P0 ─────────── P1 ─────────── P2 ─────────── ... │
│ │ LRQ:[G][G] │ LRQ:[G] │ LRQ:[G][G][G] │
│ │ runnext: G │ runnext: nil │ runnext: G │
│ │ mcache │ mcache │ mcache │
│ │ │ │ │
│ │ ↕ bound │ ↕ bound │ ↕ bound │
│ │ │ │ │
│ M0 (thread) M1 (thread) M2 (thread) │
│ │ executing G │ executing G │ executing G │
│ │
│ Idle M list: M3, M4 (waiting for P) │
│ sysmon: independent M, no P binding │
└─────────────────────────────────────────────────────────┘
Data flow:
1. go func() → new G placed in current P's LRQ
2. schedule() → take G from LRQ and execute
3. LRQ empty → steal half from another P's LRQ
4. syscall → P unbinds from M (handoff)
5. G running >10ms → sysmon sends SIGURG to preempt
Question 2: Differences between goroutines and threads
| Dimension | goroutine | OS Thread |
|---|---|---|
| Created by | Go runtime | OS kernel |
| Initial stack | 2 KB, dynamically grows | 1-8 MB, fixed |
| Scheduling | Go runtime M:N scheduling | Kernel 1:1 scheduling |
| Switch cost | ~200 ns (userspace) | ~1-10 μs (kernel mode) |
| ID | Has one but intentionally not exposed | pthread_t / tid |
| Communication | channel (CSP) | Shared memory + locks |
| Preemption | Runtime signal (10ms) | Kernel clock interrupt (~6ms) |
| Count limit | Millions | Thousands (limited by stack memory) |
Question 3: Why isn't goroutine ID exposed?
The Go team intentionally does not provide a public API to get goroutine IDs in the runtime package. The reasons:
- Exposing goroutine IDs would tempt developers to write "thread-local storage" style code (storing/retrieving state by ID), violating Go's philosophy of "share memory by communicating"
- Goroutines should be anonymous, interchangeable execution units without "identity"
- If truly needed (e.g., request tracing in logs), you should pass request-level identifiers via
context.Context
// While you CAN hack out the goroutine ID, this is NOT recommended
import "runtime"
func goid() int64 {
var buf [64]byte
n := runtime.Stack(buf[:], false)
// Parse "123" from "goroutine 123 [running]:"
// This is an extreme hack — do NOT use in production code
}
Question 4: When does a goroutine get scheduled away?
- Channel operation blocks (send/receive/select)
- Mutex/RWMutex contention failure
- time.Sleep / time.After
- Network I/O (goes through netpoller underneath)
- System call (file I/O, etc.)
- runtime.Gosched() voluntary yield
- Running beyond 10ms triggers signal preemption
- GC STW (Stop The World)
- Stack growth needed (rarely causes scheduling, but pauses execution)
Question 5: What happens when a goroutine calls LockOSThread()?
runtime.LockOSThread()
// Effects:
// 1. Current G is bound to current M; no other G will run on this M
// 2. Current M will only execute this G
// 3. If this G creates new Gs, they run on other Ms
// 4. Binding is released when G exits or calls UnlockOSThread()
// Use cases:
// - CGO calls (some C libraries require same-thread calling)
// - GUI frameworks (main thread restriction)
// - Specific Linux namespace operations (e.g., setns)
4.4 Practical Debugging: GODEBUG=schedtrace
The Go runtime provides powerful built-in debugging tools for observing real-time scheduler behavior without code changes.
schedtrace basic usage:
# Print scheduler state every 1000ms
GODEBUG=schedtrace=1000 ./myapp
# Example output:
# SCHED 0ms: gomaxprocs=8 idleprocs=5 threads=10 spinningthreads=2
# idlethreads=3 runqueue=0 [2 0 1 0 3 0 0 1]
Field meanings:
| Field | Meaning |
|---|---|
gomaxprocs=8 |
Number of Ps |
idleprocs=5 |
Number of idle Ps |
threads=10 |
Total M (OS thread) count |
spinningthreads=2 |
Spinning Ms (looking for work) |
idlethreads=3 |
Sleeping Ms |
runqueue=0 |
Gs in global run queue |
[2 0 1 0 3 0 0 1] |
Gs in each P's local queue |
scheddetail for more verbose output:
GODEBUG=schedtrace=1000,scheddetail=1 ./myapp
# Also outputs detailed state for each P and M:
# P0: status=1 schedtick=3423 syscalltick=88 m=0 runqsize=2 gfreecnt=5
# M0: p=0 curg=17 mallocing=0 throwing=0 preemptoff= locks=0
# G17: status=2(running) m=0 lockedm=-1
Real-world debugging case: Diagnosing scheduling latency
// Symptom: Some HTTP requests suddenly slow (P99 tail latency spikes from 5ms to 50ms)
// Using schedtrace:
// SCHED 3000ms: gomaxprocs=4 idleprocs=0 threads=12 spinningthreads=0
// idlethreads=8 runqueue=47 [128 64 0 0]
// Findings:
// 1. idleprocs=0 — all Ps are busy
// 2. runqueue=47 — 47 Gs queued in global queue
// 3. [128 64 0 0] — P0 and P1 severely backlogged, P2/P3 empty
// Diagnosis: P2 and P3 likely stuck in long syscalls (P transferred after handoff)
// But new Gs are still concentrated on P0/P1 (created from those contexts)
// Solution: Check if goroutines are doing long synchronous file I/O
Using the execution tracer for deeper analysis:
import "runtime/trace"
func main() {
f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// ... your program ...
}
// Then view with go tool trace:
// go tool trace trace.out
// You can see:
// - Timeline of goroutines on each P
// - Latency from G creation to execution
// - Work stealing events
// - P handoffs due to syscalls
// - GC STW duration and impact
# Capture trace from a running program (via pprof HTTP)
curl -o trace.out 'http://localhost:6060/debug/pprof/trace?seconds=5'
go tool trace trace.out
Other GODEBUG options related to scheduling:
# Combine multiple options
GODEBUG=schedtrace=1000,gctrace=1,madvdontneed=1 ./myapp
# gctrace=1 — GC logs (shows STW duration, affects scheduling)
# asyncpreemptoff=1 — disable async preemption (for comparison testing)
# tracebackancestors=N — goroutine creation chain trace depth
4.5 Production Best Practices
Practice 1: Use context to control goroutine lifecycle
// Standard pattern: all goroutines should be cancellable
func worker(ctx context.Context, tasks <-chan Task) error {
for {
select {
case <-ctx.Done():
return ctx.Err() // graceful exit
case task, ok := <-tasks:
if !ok {
return nil // channel closed
}
if err := process(ctx, task); err != nil {
return err
}
}
}
}
// Pass a cancellable context at startup
ctx, cancel := context.WithCancel(context.Background())
defer cancel() // ensures all worker goroutines exit
for i := 0; i < numWorkers; i++ {
go worker(ctx, tasks)
}
Practice 2: Limit concurrent goroutine count
// Semaphore pattern to limit concurrency
sem := make(chan struct{}, maxConcurrency)
for _, item := range items {
sem <- struct{}{} // acquire semaphore, may block
go func(item Item) {
defer func() { <-sem }() // release semaphore
process(item)
}(item)
}
// Or use errgroup
import "golang.org/x/sync/errgroup"
g, ctx := errgroup.WithContext(ctx)
g.SetLimit(maxConcurrency) // Go 1.20+
for _, item := range items {
item := item
g.Go(func() error {
return process(ctx, item)
})
}
if err := g.Wait(); err != nil {
// handle error
}
Practice 3: Avoid goroutine creation in hot paths
// Anti-pattern: creating a goroutine per request for timeout control
func handleRequest(w http.ResponseWriter, r *http.Request) {
done := make(chan struct{})
go func() {
result := doWork() // new goroutine per request
sendResponse(w, result)
close(done)
}()
select {
case <-done:
case <-time.After(5 * time.Second):
http.Error(w, "timeout", 504)
}
}
// Better: use context timeout (no extra goroutine needed)
func handleRequest(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()
result, err := doWork(ctx) // execute in current goroutine
if err != nil {
if ctx.Err() == context.DeadlineExceeded {
http.Error(w, "timeout", 504)
return
}
http.Error(w, err.Error(), 500)
return
}
sendResponse(w, result)
}
Practice 4: Monitor goroutine trends, not absolute counts
// Don't just look at absolute numbers — look at trends
// Goroutine count slowly but continuously growing = leak
// Goroutine count fluctuating with load but bounded = normal
// Grafana alert rule pseudocode:
// IF deriv(go_goroutines[1h]) > 10
// AND go_goroutines > 5000
// THEN alert("possible goroutine leak")
4.6 Edge Cases and Traps
Trap 1: "Pseudo-deadlock" with GOMAXPROCS=1
runtime.GOMAXPROCS(1)
ch := make(chan int)
go func() {
ch <- 1
}()
// If the current goroutine does heavy computation here
// and Go version < 1.14 (no async preemption)
// the new goroutine never gets a chance to run
// ch <- 1 never happens
// program hangs (not a deadlock — it's starvation)
for {
// tight loop without function calls
}
<-ch // never reached
Trap 2: CGO calls pin M
// During CGO calls, M is pinned (LockOSThread effect)
// If many goroutines simultaneously make CGO calls
// M count can explode to the 10,000 limit
/*
#include <unistd.h>
void slow_c_func() {
sleep(10); // blocks for 10 seconds
}
*/
import "C"
func main() {
for i := 0; i < 10001; i++ {
go func() {
C.slow_c_func() // each goroutine pins an M
}()
}
// May trigger "thread exhaustion" and crash
}
// Solution: use a worker pool to limit concurrent CGO calls
Trap 3: GC STW impact on latency
// GC's STW (Stop The World) phase pauses all goroutines
// In Go 1.18+ STW is typically only tens of microseconds
// But with many goroutines, the STW "barrier" effect increases latency
// Observe GC STW duration:
// GODEBUG=gctrace=1 ./myapp
// gc 1 @0.012s 2%: 0.044+1.2+0.033 ms clock, 0.35+0.8/1.1/0+0.26 ms cpu ...
// ^^^ ^^^
// STW mark start STW mark termination
// Reducing GC pressure = fewer STWs = lower scheduling latency
Trap 4: runtime.LockOSThread inheritance behavior
func init() {
// main goroutine locks OS thread in init
runtime.LockOSThread()
// Note: if main goroutine exits without UnlockOSThread
// the thread is destroyed rather than returned to the idle pool
}
// Also, LockOSThread is nestable
runtime.LockOSThread()
runtime.LockOSThread() // count +1
runtime.UnlockOSThread() // count -1, still locked
runtime.UnlockOSThread() // count reaches zero, unlocked
Chapter Summary
The GMP scheduler is one of Go's most critical runtime components. Its design elegantly balances multiple objectives:
- Efficient creation: goroutine creation is a pure userspace operation costing ~0.3μs
- Fair scheduling: work stealing ensures load balance; signal preemption prevents starvation
- Syscalls don't block scheduling: P/M unbinding (handoff) ensures CPUs stay utilized
- Scalability: local lock-free queues eliminate global lock bottlenecks
Understanding the GMP model isn't just for interviews — it helps you:
- Diagnose goroutine leaks and scheduling latency issues
- Configure GOMAXPROCS correctly (especially in container environments)
- Understand why certain operations (channel, mutex, I/O) trigger scheduling
- Write concurrent code that is friendly to the scheduler
Recommended Further Reading:
- Dmitry Vyukov, "Scalable Go Scheduler Design Doc" (2012) — the original GMP model design document
- Blumofe & Leiserson, "Scheduling Multithreaded Computations by Work Stealing" (JACM 1999) — work stealing theoretical foundation
- Austin Clements, "Proposal: Non-cooperative goroutine preemption" (Go proposal #24543, 2018) — async preemption proposal
- Go source code
runtime/proc.go— complete scheduler implementation (~6000 lines) go tool tracedocumentation — the best tool for visualizing scheduling behavior