Chapter 14

The GMP Scheduler: How Goroutines Run

Chapter 14: The GMP Scheduler — How Goroutines Run

The moment you write go func(){}(), the Go runtime faces a fundamental question: who executes this function? On which CPU core? When? This is not a simple thread creation — Go must efficiently schedule tens of thousands of goroutines onto a limited number of operating system threads. The mechanism that solves this problem is the GMP scheduler.

The GMP scheduler is the foundation of Go's concurrency capabilities. Understanding it reveals why goroutines are lighter than threads, why Go can effortlessly handle millions of concurrent tasks, why some goroutines can "starve," and why setting GOMAXPROCS correctly matters so much.

This chapter starts from intuition and progressively descends into source-level implementation and design decisions, building a complete mental model of the scheduler.

Level 1: What You Need to Know

1.1 Why Goroutines Are Lighter Than Threads

Every Go beginner hears the mantra: "goroutines are lightweight." But lightweight in what way exactly? Let's speak in concrete numbers.

Stack Size Comparison:

Dimension	OS Thread	Goroutine
Initial stack size	1-8 MB (typically 8 MB on Linux)	2 KB (Go 1.4+)
Stack growth	Fixed size, allocated at creation	Dynamic, grows on demand up to 1 GB
Creation cost	~1-10 μs (involves syscall)	~0.3 μs (pure userspace)
Context switch	~1-10 μs (kernel trap)	~0.2 μs (userspace switch)
Memory for 10,000	80 GB (impossible)	20 MB (trivial)

Three key technical differences drive these numbers:

First, goroutines use dynamically growing stacks. An OS thread's stack size is fixed at creation because the kernel cannot safely relocate stack frames at runtime. A goroutine's stack is managed by the Go runtime — starting at just 2 KB. When function call depth increases and the runtime detects insufficient stack space, it allocates a new stack twice the size, copies the old stack contents over, and adjusts all pointers to the old stack. This is the "stack copying" mechanism introduced in Go 1.3, replacing the earlier "segmented stack" approach.

// Verifying goroutine initial stack size
package main

import (
    "fmt"
    "runtime"
    "sync"
)

func main() {
    var wg sync.WaitGroup
    n := 100000
    wg.Add(n)
    
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    before := m.Sys
    
    for i := 0; i < n; i++ {
        go func() {
            select {} // block to keep goroutine alive
        }()
    }
    
    runtime.ReadMemStats(&m)
    after := m.Sys
    
    fmt.Printf("Created %d goroutines\n", n)
    fmt.Printf("Memory growth: %.2f MB\n", float64(after-before)/1024/1024)
    fmt.Printf("Per goroutine: ~%.2f KB\n", float64(after-before)/float64(n)/1024)
}

Second, goroutine switches don't require kernel traps. An OS thread context switch transitions from user mode to kernel mode (via syscall or interrupt), saves/restores the full CPU register set (including floating-point, SSE/AVX registers), and may flush TLB entries. A goroutine switch happens entirely in userspace, saving only a handful of registers (SP, PC, and a few callee-saved registers) — roughly 40-50 bytes of state total.

Third, goroutine creation requires no system call. Creating an OS thread requires the clone() syscall (Linux) or CreateThread() (Windows), meaning a kernel trap, allocation of kernel data structures, TLS setup, and more. Creating a goroutine only needs to grab a g struct from the free pool (or malloc one) in userspace, set up its stack and entry function, and place it on a run queue — no syscall anywhere in the path.

1.2 Intuitive Understanding of G/M/P

The GMP model can be understood through a factory analogy:

┌──────────────────────────────────────────────────────────┐
│                    Factory (Go Process)                    │
│                                                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐              │
│  │ Station P0│  │ Station P1│  │ Station P2│   ...       │
│  │          │  │          │  │          │              │
│  │ Local Queue│ │ Local Queue│ │ Local Queue│              │
│  │ [G][G][G]│  │ [G][G]   │  │ [G]      │              │
│  │          │  │          │  │          │              │
│  │ Worker M0 │  │ Worker M1 │  │ Worker M2 │              │
│  │ (active)  │  │ (active)  │  │ (active)  │              │
│  └──────────┘  └──────────┘  └──────────┘              │
│                                                          │
│  Global Task Queue: [G][G][G][G][G]...                   │
│                                                          │
│  Break Room (idle workers): M3, M4, M5 ...              │
└──────────────────────────────────────────────────────────┘

G (Goroutine) = Work Order. Each G represents a function to be executed, containing the function entry point, arguments, stack space, and current execution state. It is the smallest schedulable unit. Corresponds to runtime.g in the runtime.

M (Machine) = Worker. Each M maps to one OS thread. M is the entity that actually executes code — the CPU only understands threads, not goroutines. M picks a G from P's queue and runs it. Corresponds to runtime.m in the runtime.

P (Processor) = Workstation. P is a logical concept representing "the resources needed to execute Go code." Each P owns a local run queue, a memory cache (mcache), and other scheduling state. An M must hold a P to execute goroutines. The number of Ps is determined by GOMAXPROCS. Corresponds to runtime.p in the runtime.

Why is P necessary? You might ask: since M is the thread and G is the task, why not have M pull G directly from a global queue? The answer is performance. If all Ms pulled from a single global queue, that queue would need a mutex, and under high concurrency the lock contention would severely degrade performance. With P, each M binds to a P and preferentially dequeues from P's local queue (a lock-free operation). Only when the local queue is empty does it go to the global queue or steal from other Ps.

// Inspecting current GMP state
package main

import (
    "fmt"
    "runtime"
)

func main() {
    fmt.Printf("GOMAXPROCS (number of Ps): %d\n", runtime.GOMAXPROCS(0))
    fmt.Printf("NumCPU (CPU cores): %d\n", runtime.NumCPU())
    fmt.Printf("NumGoroutine (active Gs): %d\n", runtime.NumGoroutine())
}

1.3 GOMAXPROCS: Meaning and Configuration

GOMAXPROCS determines the maximum number of threads simultaneously executing Go code — i.e., the number of Ps. Note the keyword "simultaneously" — this is not a limit on the number of goroutines, nor on the number of threads, but a limit on parallelism.

Default value: Since Go 1.5, GOMAXPROCS defaults to the number of CPU cores. Before that, it defaulted to 1, meaning all goroutines could only be concurrent but never parallel.

Setting methods:

// Method 1: Environment variable
// GOMAXPROCS=4 go run main.go

// Method 2: In code
import "runtime"

func init() {
    runtime.GOMAXPROCS(4) // returns the previous value
}

// Method 3: Query current value
current := runtime.GOMAXPROCS(0) // passing 0 queries without modifying

Common misconceptions:

Misconception: Higher GOMAXPROCS is always better. For CPU-bound tasks, setting it to the number of CPU cores is optimal. Exceeding core count only adds scheduling overhead and cache misses. For I/O-bound tasks, you can increase it slightly, but the default usually suffices.
Misconception: GOMAXPROCS limits thread count. It doesn't. The number of Ms (OS threads) can far exceed the number of Ps. When goroutines block on syscalls, the runtime creates new Ms to keep Ps busy. The default maximum thread count is 10,000 (adjustable via runtime/debug.SetMaxThreads).
Misconception: GOMAXPROCS auto-adapts in containers. The Go runtime reads the host machine's CPU core count, not the container's CPU quota. In Kubernetes, a Pod limited to 2 cores running on a 64-core host will have GOMAXPROCS=64, causing severe scheduling overhead. The solution is the uber-go/automaxprocs library.

// Recommended approach for containerized environments
import _ "go.uber.org/automaxprocs" // auto-sets based on CFS quota

func main() {
    // GOMAXPROCS is now automatically set to container CPU limit
}

1.4 Goroutine Lifecycle

A goroutine transitions through these states from creation to termination:

                    ┌─────────────────────────────────────┐
                    │                                     ▼
Create(_Gidle) ──→ Runnable(_Grunnable) ──→ Running(_Grunning) ──→ Dead(_Gdead)
                    ▲                         │
                    │                         ▼
                    └──── Wake ◄──── Waiting(_Gwaiting)
                                          │
                                          ▼
                                    Syscall(_Gsyscall)

State details:

State	Meaning	Typical trigger
`_Gidle`	Just allocated, uninitialized	`runtime.newproc` allocates G struct
`_Grunnable`	Ready, waiting to be scheduled	Creation complete / woken from blocking
`_Grunning`	Executing on some M/P	Selected by scheduler
`_Gwaiting`	Blocked waiting for an event	Channel ops / select / time.Sleep
`_Gsyscall`	Executing a system call	File I/O / network (non-netpoller)
`_Gdead`	Finished or unused	Function return / panic

Creation process:

When you write go f(args), the compiler transforms it into a call to runtime.newproc:

// Compiler transformation:
// go f(x, y)  →  runtime.newproc(f, x, y)

// Simplified creation flow:
// 1. Get an idle G from current P's gFree list (reuse), or malloc a new one
// 2. Set up G's stack, entry function (fn), and arguments
// 3. Set G's status to _Grunnable
// 4. Place G at the tail of current P's local run queue
// 5. If there's an idle P and no spinning M, wake one M

Blocking and waking:

Goroutine blocking differs fundamentally from thread blocking. When a goroutine blocks on a channel operation:

G's state becomes _Gwaiting
G is detached from M and placed into the channel's wait queue
M does NOT block — it immediately picks the next G from P's queue
When the other end of the channel is ready, the blocked G is placed back onto some P's run queue

This is the core advantage of goroutines: G blocking does not waste M (thread) resources.

// Observing goroutine state transitions
package main

import (
    "fmt"
    "runtime"
    "time"
)

func main() {
    fmt.Printf("Goroutines at start: %d\n", runtime.NumGoroutine())
    
    ch := make(chan struct{})
    
    go func() {
        // State: _Grunnable → _Grunning
        fmt.Println("goroutine running")
        <-ch // State: _Grunning → _Gwaiting (blocked on channel)
        fmt.Println("goroutine woken")
        // After return: _Grunning → _Gdead
    }()
    
    time.Sleep(100 * time.Millisecond)
    fmt.Printf("Goroutines while blocked: %d\n", runtime.NumGoroutine())
    
    ch <- struct{}{} // Wake: G state _Gwaiting → _Grunnable → _Grunning
    time.Sleep(100 * time.Millisecond)
    fmt.Printf("Goroutines after completion: %d\n", runtime.NumGoroutine())
}

System call scenario:

When a goroutine enters a system call (e.g., file I/O), the situation differs:

G's state becomes _Gsyscall
M is blocked in the kernel (unavoidable — the kernel doesn't provide universal async file I/O)
P is unbound from M (handoff) and bound to another idle M (or a new M is created)
When the syscall returns, M attempts to reacquire its previous P; if P is taken, M puts G into the global queue and goes to sleep

This guarantees that even if goroutines are stuck in long syscalls, scheduling of other goroutines is unaffected.

1.5 Common Errors and Fixes

Error 1: Goroutine leak

// Bug: nobody sends to ch, goroutine blocks forever
func leak() {
    ch := make(chan int)
    go func() {
        val := <-ch // blocks forever, goroutine cannot be GCed
        fmt.Println(val)
    }()
    // Function returns, ch is unreachable, but goroutine is still waiting
}

// Fix 1: Use context to control lifecycle
func noLeak(ctx context.Context) {
    ch := make(chan int)
    go func() {
        select {
        case val := <-ch:
            fmt.Println(val)
        case <-ctx.Done():
            return // exit on timeout or cancellation
        }
    }()
}

// Fix 2: Use buffered channel
func noLeak2() {
    ch := make(chan int, 1) // sender won't block even if nobody reads
    go func() {
        ch <- 42
    }()
}

Error 2: Ignoring GOMAXPROCS impact on CPU-bound tasks

// Running CPU-intensive computation on a 4-core machine
// If GOMAXPROCS=1, these goroutines are concurrent but NOT parallel
func compute() {
    var wg sync.WaitGroup
    for i := 0; i < 4; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            sum := 0
            for j := 0; j < 1_000_000_000; j++ {
                sum += j
            }
        }()
    }
    wg.Wait()
}
// GOMAXPROCS=1: ~12s
// GOMAXPROCS=4: ~3s (linear speedup)

Level 2: How It Works Under the Hood

2.1 The Complete Scheduling Workflow

To understand the full workflow, we need to examine runtime.schedule(). This is the scheduler's core loop — every M enters this function when it needs a new G to execute.

schedule() execution flow:
┌────────────────────────────────────────────────────────┐
│  schedule()                                            │
│  │                                                    │
│  ├─ 1. If current G is locked to M, handle LockOSThread│
│  │                                                    │
│  ├─ 2. Find a runnable G (findRunnable)               │
│  │     ├─ Check local run queue                       │
│  │     ├─ Check global run queue                      │
│  │     ├─ Check netpoller                             │
│  │     ├─ Try work stealing                           │
│  │     └─ Nothing found? Block and wait               │
│  │                                                    │
│  ├─ 3. execute(gp) — switch to target G's context     │
│  │                                                    │
│  └─ 4. G completes / blocks / preempted → back to     │
│        schedule()                                      │
└────────────────────────────────────────────────────────┘

The specific lookup order in findRunnable (from runtime/proc.go):

// Simplified findRunnable logic (Go 1.22)
func findRunnable() (gp *g, inheritTime, tryWakeP bool) {
    pp := getg().m.p.ptr()
    
    // 1. Every 61st schedule tick, check global queue (prevent starvation)
    if pp.schedtick%61 == 0 && sched.runqsize > 0 {
        lock(&sched.lock)
        gp := globrunqget(pp, 1) // grab 1 from global queue
        unlock(&sched.lock)
        if gp != nil {
            return gp, false, false
        }
    }
    
    // 2. Get from local run queue
    if gp, inheritTime := runqget(pp); gp != nil {
        return gp, inheritTime, false
    }
    
    // 3. Get from global run queue
    if sched.runqsize != 0 {
        lock(&sched.lock)
        gp := globrunqget(pp, 0)
        unlock(&sched.lock)
        if gp != nil {
            return gp, false, false
        }
    }
    
    // 4. Get ready network I/O goroutines from netpoller
    if netpollinited() && netpollAnyWaiters() && sched.lastpoll.Load() != 0 {
        if list, delta := netpoll(0); !list.empty() {
            gp := list.pop()
            injectglist(&list) // put remainder in local/global queue
            return gp, false, false
        }
    }
    
    // 5. Try stealing from other Ps (Work Stealing)
    for i := 0; i < 4; i++ { // up to 4 rounds
        for enum := stealOrder.start(fastrand()); !enum.done(); enum.next() {
            p2 := allp[enum.position()]
            if gp := runqsteal(pp, p2, stealRunNextG); gp != nil {
                return gp, false, false
            }
        }
    }
    
    // 6. Nothing available — go to sleep
    stopm() // put M in idle list, unbind P
}

When does the scheduler run? Not all code runs uninterrupted. The Go scheduler gets execution opportunities at these points:

Goroutine voluntarily yields: channel ops, mutex locks, time.Sleep, runtime.Gosched()
Stack check on function calls: the compiler inserts morestack checks at function entries, which also serve as preemption checkpoints
Before/after system calls: entering/exiting syscalls gives the scheduler a chance to adjust M/P bindings
Asynchronous preemption signal: Go 1.14+ uses signals to forcibly interrupt long-running Gs

2.2 Work Stealing: When P's Local Queue Is Empty

Work Stealing is the core mechanism by which the GMP scheduler maintains load balance. When a P's local queue is empty, it doesn't sit idle — it steals work from other Ps.

Work Stealing illustration:
                                                      
P0's perspective:                                     
                                                      
┌──────────┐     Local queue empty!     ┌──────────┐  
│    P0    │  ── pick random P ────→    │    P2    │  
│ queue:[ ]│                            │queue:[G G G G]│
│          │  ◄── steal half (2 Gs) ──  │          │  
│ queue:[G G]│                           │ queue:[G G] │
└──────────┘                            └──────────┘

Work Stealing rules:

How much to steal? Half of the target P's local queue. If the target has 6 Gs, steal 3. This ensures both sides have work to do.
From whom? A random starting P is chosen, then all Ps are iterated. The random start prevents all idle Ps from targeting the same victim.
Which end of the queue? P's local run queue is a lock-free ring buffer (size 256). The head is the next G to execute; the tail contains the most recently enqueued G. Stealing occurs from the tail — leveraging locality: tail Gs were most recently enqueued and likely haven't built cache affinity on the target P yet, making them cheaper to migrate.
What else can be stolen? Besides queue Gs, the target's runnext can also be stolen — this is a special "about to run" G pointer, controlled by the stealRunNextG flag.

// Observing Work Stealing behavior
package main

import (
    "fmt"
    "runtime"
    "sync"
    "sync/atomic"
)

func main() {
    runtime.GOMAXPROCS(4)
    
    var counters [4]atomic.Int64
    var wg sync.WaitGroup
    
    // All goroutines created from one P's context
    // Observe that they get distributed via work stealing
    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            sum := 0
            for j := 0; j < 1_000_000; j++ {
                sum += j
            }
        }()
    }
    
    wg.Wait()
    fmt.Println("Work stealing ensures load balance across", runtime.GOMAXPROCS(0), "Ps")
    _ = counters
}

2.3 M/P Separation During System Calls (Handoff)

When a goroutine executes a system call, the entire M (OS thread) is blocked by the kernel. Without intervention, the P bound to that M would also sit idle, wasting a parallelism slot.

The handoff mechanism:

Before syscall:        During syscall:        After syscall return:
                                         
M0 ─── P0             M0 (blocked)          M0 ─── P0 (if P0 free)
 │                         ↓                 or
 G1 (syscall)          P0 unbound            M0 → puts G1 in global queue
                         ↓                        M0 goes idle
                    M2 ─── P0              
                     │                     
                     G3 (running)

Detailed steps:

Before entering syscall: runtime.entersyscall() is called. G's state → _Gsyscall, P's state → _Psyscall.
sysmon detection: The sysmon thread periodically checks all Ps in _Psyscall state. If a syscall has lasted more than 20μs (or one sysmon tick), it performs handoff — unbinding P from M and giving it to another idle M or creating a new M.
Syscall return: M tries to reacquire its previous P. If P is occupied by another M, M tries to acquire any idle P. If no idle P is available, M puts G into the global run queue and goes to sleep.

// Observing thread growth from system calls
package main

import (
    "fmt"
    "os"
    "runtime"
    "sync"
    "time"
)

func main() {
    runtime.GOMAXPROCS(2) // only 2 Ps
    
    fmt.Printf("Initial: GOMAXPROCS=%d\n", runtime.GOMAXPROCS(0))
    
    var wg sync.WaitGroup
    for i := 0; i < 10; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            f, _ := os.Open("/dev/zero")
            buf := make([]byte, 1)
            f.Read(buf) // blocking syscall
            time.Sleep(time.Second)
            f.Close()
        }(i)
    }
    
    time.Sleep(100 * time.Millisecond)
    fmt.Printf("NumGoroutine: %d\n", runtime.NumGoroutine())
    // More than 2 Ms should be created (syscalls block original Ms)
    
    wg.Wait()
}

Special handling of network I/O — the netpoller:

Go applies a special optimization for network I/O using epoll (Linux) / kqueue (macOS) / IOCP (Windows) for non-blocking I/O. When a goroutine performs network read/write:

The underlying fd is set to non-blocking mode
If I/O isn't ready, G is parked on the netpoller's wait list (state _Gwaiting)
M is NOT blocked — it immediately executes other Gs
The netpoller is checked during findRunnable, and ready Gs are placed back on run queues

This is why Go network servers can handle massive concurrent connections with few threads — goroutines blocking on network I/O consume zero OS thread resources.

2.4 Preemptive Scheduling

Cooperative preemption (Go 1.13 and earlier):

Before Go 1.14, the scheduler relied on "cooperative preemption." The compiler inserts stack growth checks (morestack) at function entries. The scheduler signals a preemption request by setting G's stackguard0 field to a special sentinel value. The next time G calls a function, the stack check fires, discovers the preemption mark, and voluntarily yields the CPU.

The problem: If a goroutine executes a tight loop without function calls, it never checks the preemption mark, and other goroutines starve:

// Before Go 1.14, this goroutine is never preempted
go func() {
    for {
        // Pure computation, no function calls
        // No preemption check point
        x++
    }
}()
// Other goroutines may starve

Signal-based asynchronous preemption (Go 1.14+):

Go 1.14 introduced signal-based asynchronous preemption (proposal #24543, by Austin Clements):

sysmon thread detects that a G has been running for more than 10ms
sysmon sends SIGURG to the target M (SIGURG was chosen because it doesn't interfere with debuggers or standard signal handlers)
M's signal handler sighandler receives the signal
The signal handler checks if execution is at a safe point; if so, it modifies G's PC register to point to asyncPreempt
After the signal handler returns, G actually jumps to asyncPreempt, saves all register state, then calls schedule() to yield

Async preemption flow:

sysmon detects: G running > 10ms
        │
        ▼
Send SIGURG to M
        │
        ▼
M's signal handler takes over
        │
        ▼
At safe point?
    ├── No → skip, try again later
    └── Yes → set G.pc = asyncPreempt
                │
                ▼
        signal return → execute asyncPreempt
                │
                ▼
        save all registers → gopreempt_m() → schedule()

The concept of safe points:

Not every moment is safe for preempting a goroutine. For example, if G is executing within a non-preemptible region of the runtime (such as certain GC marking operations), forced preemption could lead to inconsistent state. The Go runtime checks these conditions to determine if it's at a safe point:

Not in _Gsyscall state
Not holding runtime-internal locks
Stack frame information is available (can generate correct stack maps for GC)

2.5 The sysmon Monitor Thread

sysmon is a special daemon thread in the Go runtime — it doesn't bind to any P, runs independently, and serves as the runtime's "watchdog."

// sysmon's main responsibilities (simplified from runtime/proc.go)
func sysmon() {
    idle := 0
    delay := uint32(0)
    
    for {
        // Adaptive sleep: min 20μs when busy, max 10ms when idle
        if idle == 0 {
            delay = 20 // 20μs
        } else if idle > 50 {
            delay = 10000 // 10ms
        }
        usleep(delay)
        
        // 1. Network polling: if >10ms since last netpoll, do a non-blocking poll
        lastpoll := sched.lastpoll.Load()
        if netpollinited() && lastpoll != 0 && lastpoll+10*1000*1000 < now {
            sched.lastpoll.Store(now)
            list, _ := netpoll(0) // non-blocking
            if !list.empty() {
                injectglist(&list) // put ready Gs in global queue
            }
        }
        
        // 2. Preempt long-running Gs
        retake(now) // check all Ps, preempt or handoff
        
        // 3. Force GC: if >2 minutes without GC, force trigger
        if t := (gcTrigger{kind: gcTriggerTime, now: now}); t.test() {
            forcegc.g.schedlink = 0
            injectglist(&forcegc.g)
        }
        
        // 4. Scavenge: return long-unused heap memory to OS
    }
}

sysmon's retake function — preemption and handoff logic:

func retake(now int64) uint32 {
    n := 0
    for i := 0; i < len(allp); i++ {
        pp := allp[i]
        pd := &pp.sysmontick
        s := pp.status
        
        if s == _Prunning || s == _Psyscall {
            // For running Ps: if G has run > forcePreemptNS (10ms)
            // set preemption flag
            t := int64(pp.schedtick)
            if pd.schedtick != t {
                pd.schedtick = t
                pd.schedwhen = now
            } else if pd.schedwhen+forcePreemptNS <= now {
                preemptone(pp) // send preemption signal
                n++
            }
        }
        
        if s == _Psyscall {
            // For Ps in syscall: if exceeds one sysmon tick
            // and local queue is non-empty or no idle P available, handoff
            if runqempty(pp) && sched.nmspinning.Load()+sched.npidle.Load() > 0 {
                continue // no need to handoff
            }
            if pd.syscallwhen+10*1000*1000 > now {
                continue // not timed out yet
            }
            handoffp(pp) // unbind P from blocked M
            n++
        }
    }
    return uint32(n)
}

sysmon's running frequency:

sysmon does not run at a fixed frequency. It uses an adaptive sleep strategy:

Initially checks every 20μs
If no events are found consecutively, gradually increases the interval
Maximum interval is 10ms
As soon as an event is detected, the interval drops immediately

This balances responsiveness and CPU overhead — checking frequently when the system is busy, reducing overhead when idle.

2.6 Locality Optimizations in the Scheduler

The Go scheduler includes several locality optimizations to reduce cache misses and improve performance:

runnext optimization: Each P has a runnext field pointing to "the next G that should run." When a G creates a new G (go func()), the new G isn't placed at the queue tail but set as the current P's runnext. This means the new G is scheduled immediately, exploiting producer-consumer locality (the data produced is still in cache when the consumer runs).

P affinity: When a G is woken from blocking, the runtime preferentially places it back on the P where it last ran, leveraging existing cache warmth.

Lock-free local queue: P's local run queue is a 256-element ring array implementing a lock-free single-producer multi-consumer queue (only the owning M pushes, but other Ps can steal).

Level 3: What the Specification Defines

3.1 Evolution of the GMP Model: From GM to GMP

Go 1.0's GM model:

Go's original scheduler (Go 1.0) was very simple: only G and M, no P concept.

Go 1.0 scheduler:
                                 
┌──────────────────────────────┐
│      Global Run Queue        │
│    [G] [G] [G] [G] [G]      │
│       (mutex protected)       │
└──────────┬───────────────────┘
           │
    ┌──────┼──────┐
    ▼      ▼      ▼
   M0     M1     M2
   │       │      │
   G       G      G

This model had severe problems:

Global queue lock contention: Every M contested the global queue's mutex each time it needed a new G. Under high concurrency, this lock became a severe bottleneck.
G migration destroying locality: When M0 creates a new G, that G enters the global queue and might be picked up by M1. If the new G needs to access data M0 just processed, cache locality is destroyed.
Frequent M blocking and waking: Every M blocked by a syscall must contest the global lock again upon waking.
Memory allocator contention: Go's memory allocator (mcache) was bound to M. Since M count inflates due to syscalls, many Ms holding their own mcache caused memory waste.

Dmitry Vyukov's GMP redesign (Go 1.1, 2013):

In March 2012, Google's Dmitry Vyukov submitted his landmark design document "Scalable Go Scheduler Design Doc" (https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sLKhJYD0Y_kqxDv3I3XMw). This document analyzed the four core deficiencies of the GM model and proposed introducing P as the solution.

What P solves:

GM Model Problem	How P Solves It
Global queue lock contention	Each P has a lock-free local queue; most operations need no global lock
Poor cache locality	G preferentially runs on the P that created it
M inflation wastes mcache	mcache bound to P (fixed count), not M
Thread blocking wastes parallelism slot	P unbinds from blocked M, immediately assigned to another M

Key numeric choices in the design:

P's local queue size is 256 (power of 2 for efficient modulo; large enough to reduce global queue interactions)
Preemption threshold is 10ms (balances latency and throughput — too short means frequent switching, too long means other Gs wait too long)
Work stealing takes half (from the Blumofe-Leiserson balanced strategy)
Global queue is checked every 61 schedule ticks (61 is prime, avoiding resonance with fixed patterns in programs)

3.2 Why P Was Needed — Eliminating Global Lock Contention

Let's quantify P's value with performance data. Dmitry Vyukov provided these benchmark results in his design document (on an 8-core machine):

benchmark                    old ns/op    new ns/op    speedup
BenchmarkCreateGoroutine     2080         480          4.3x
BenchmarkCreateGoroutineIdle 1010         66           15.3x
BenchmarkOsYield             7700         5700         1.4x
BenchmarkPing                46000        27000        1.7x

Goroutine creation speed improved 4-15x, primarily from eliminating global queue lock contention.

Lock-free local queue implementation details:

P's local run queue uses a classic single-producer multi-consumer lock-free ring buffer:

type p struct {
    // ...
    runqhead uint32        // atomic read/write
    runqtail uint32        // only owner writes
    runq     [256]guintptr // ring buffer
    runnext  guintptr      // atomic operations
}

// Enqueue (only M owning this P calls this)
func runqput(pp *p, gp *g, next bool) {
    if next {
        // Set as runnext (atomic CAS)
        oldnext := pp.runnext
        if !pp.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
            retry...
        }
        if oldnext == 0 {
            return
        }
        gp = oldnext.ptr() // old runnext goes into queue
    }
    
    h := atomic.LoadAcq(&pp.runqhead)
    t := pp.runqtail
    if t-h < uint32(len(pp.runq)) {
        pp.runq[t%uint32(len(pp.runq))].set(gp)
        atomic.StoreRel(&pp.runqtail, t+1) // ensure G is visible to stealers
        return
    }
    // Local queue full — move half to global queue
    runqputslow(pp, gp, h, t)
}

Key points:

runqtail is only written by the owner — no CAS needed
runqhead requires atomic operations because stealers modify it
atomic.LoadAcq and atomic.StoreRel ensure memory ordering
When local queue is full (256 entries), half the Gs move to the global queue — this is the "overflow" mechanism

3.3 Theoretical Foundation of Work Stealing

Work Stealing's theoretical foundation comes from Robert D. Blumofe and Charles E. Leiserson's 1999 paper "Scheduling Multithreaded Computations by Work Stealing" (Journal of the ACM, Vol. 46, No. 5).

Core theorem: For a parallel computation with total work T₁ and critical path length T∞, a work-stealing scheduler using P processors achieves expected execution time:

E[Tp] ≤ T₁/P + O(T∞)

Where:

T₁ is the serial execution time (sum of all tasks)
T∞ is the critical path length (longest dependency chain)
P is the number of processors

What does this bound mean? It proves work stealing is "near-optimal" — execution time consists of two parts: ideal parallel amortization (T₁/P) plus unavoidable serial dependencies (T∞). Theoretically, no scheduler can do better (the lower bound for any scheduler is max(T₁/P, T∞)).

Work Stealing vs Work Sharing comparison:

Property	Work Stealing	Work Sharing
When tasks migrate	Idle processor actively steals	Creator actively distributes
Communication overhead	Only communicates when idle	Communicates on every G creation
Cache locality	Good (G tends to run on creator)	Poor (G immediately distributed remotely)
Load balance latency	May have brief imbalance	More immediate balance
Best for	Frequent task creation, uneven durations	Infrequent creation, uniform durations

Go chose work stealing because goroutine creation is extremely frequent (potentially millions per second). If every creation involved distribution, communication overhead would overwhelm the system.

Another key conclusion from Blumofe-Leiserson — steal attempt count:

The theory proves that the expected total number of steal attempts throughout a computation is O(P · T∞). This means steal operations (which access remote P queues and incur communication cost) scale linearly with processor count but are independent of total work. For Go programs, if dependency chains between goroutines are short (small T∞), steals are infrequent, and most of the time each P executes locally.

3.4 Comparison with Other Runtime Schedulers

Erlang BEAM scheduler:

Erlang's BEAM virtual machine uses a model similar to Go's but predates it:

One scheduler thread per CPU core (equivalent to P)
Each scheduler has its own run queue
Uses a hybrid of work stealing and work sharing
Key difference from Go: Erlang processes share absolutely no memory — message passing is the sole communication mechanism, so the scheduler can migrate processes without considering cache coherence costs

Erlang BEAM:                        Go GMP:
                                    
Scheduler 1 ─── RunQueue            P0 ─── LocalRunQueue
Scheduler 2 ─── RunQueue            P1 ─── LocalRunQueue
Scheduler 3 ─── RunQueue            P2 ─── LocalRunQueue
                                    GlobalRunQueue
Migration queues                    Work Stealing
Reduction/compaction on low load    Spinning M waiting

Erlang's unique approach is "reduction counting" for preemption: each process runs approximately 4000 reductions (roughly corresponding to function calls and BIF invocations) before being forcibly switched. This is more precise than Go's 10ms time slice because it doesn't depend on clock interrupts.

Java Virtual Threads (Project Loom, JDK 19+):

Java's Virtual Threads were directly inspired by Go goroutines, adopting a very similar M:N scheduling model:

Platform Thread ≈ M (OS thread)
Virtual Thread ≈ G (userspace fiber)
ForkJoinPool ≈ Collection of Ps

// Java Virtual Thread example
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
    for (int i = 0; i < 100_000; i++) {
        executor.submit(() -> {
            Thread.sleep(Duration.ofSeconds(1));
            return "done";
        });
    }
}

Key differences:

Dimension	Go goroutine	Java Virtual Thread
Scheduler	Custom GMP	ForkJoinPool (work stealing)
Preemption	Signal-based async (10ms)	Cooperative only (safepoint)
Stack	Contiguous, copy-on-grow	Frames stored on heap
Pinning	LockOSThread	synchronized blocks pin
Structured concurrency	No native support	StructuredTaskScope
Maturity	Since 2012 (10+ years)	GA in 2023 (JDK 21)

A known issue with Java Virtual Threads is "pinning" — when a virtual thread enters a synchronized block or executes native methods, it gets pinned to its carrier thread and cannot be unmounted by the scheduler. This is analogous to Go's LockOSThread().

Rust tokio runtime:

Rust's async runtime tokio also uses a work-stealing scheduler, but the model is fundamentally different:

// Rust tokio example
#[tokio::main]
async fn main() {
    let handles: Vec<_> = (0..100_000).map(|i| {
        tokio::spawn(async move {
            tokio::time::sleep(Duration::from_secs(1)).await;
            i
        })
    }).collect();
    
    for handle in handles {
        handle.await.unwrap();
    }
}

Dimension	Go goroutine	Rust tokio task
Model	Stackful coroutine	Stackless coroutine
Memory per task	~2-8 KB (stack)	~tens of bytes (Future state machine)
Yield mechanism	Runtime implicitly manages	Must explicitly `.await`
Preemption	Yes (Go 1.14+)	No (fully cooperative)
Compile-time guarantees	No Send/Sync checking	Compile-time Send + 'static check
Blocking handling	Automatic handoff	Must use explicit `spawn_blocking`

tokio's core difference is the stackless coroutine model — Rust's async/await is compiled into state machines by the compiler, requiring no stack allocation per task. This yields higher memory efficiency, but the cost is that every I/O point requires an explicit await, and preemption is impossible. If a task performs a blocking operation without spawn_blocking, it blocks the entire worker thread.

3.5 Key Design Trade-offs in the Scheduler

Go's scheduler design is full of deliberate trade-offs:

1. Fairness vs Throughput:

The 10ms preemption threshold is a compromise. Shorter time slices provide better fairness (low latency) but increase switching overhead (lower throughput). Linux's CFS scheduler has a default time slice of ~6ms (determined by sched_latency and sched_min_granularity); Go's 10ms is relatively generous, favoring throughput.

2. Queue Size vs Overflow Frequency:

The 256-element local queue is another trade-off. A larger queue reduces overflow to the global queue but increases coherence costs when stolen from; a smaller queue makes work stealing more frequent but also faster.

3. Spinning Ms vs Response Latency:

The Go scheduler maintains a small set of "spinning" Ms — they hold a P but aren't executing any G, continuously spinning looking for work. This wastes CPU but reduces wakeup latency. The number of spinning Ms is kept small (no more than half the idle Ps), balancing CPU utilization and responsiveness.

Level 4: Edge Cases and Pitfalls

4.1 Common Causes and Detection of Goroutine Leaks

Goroutine leaks are the most common resource leak in Go programs — forgotten goroutines cannot be garbage collected (their stacks may hold references), causing memory to grow continuously until OOM.

Common leak scenarios:

// Scenario 1: Sending to a channel nobody receives from
func leak1() {
    ch := make(chan int)
    go func() {
        ch <- expensiveComputation() // blocks forever
    }()
    // Function returns, ch unreachable, but goroutine can't exit
}

// Scenario 2: Receiving from a channel nobody sends to
func leak2(ctx context.Context) error {
    results := make(chan *Result)
    go func() {
        r, err := callExternalService()
        if err != nil {
            return // Note: returns without sending to channel
        }
        results <- r
    }()
    
    select {
    case r := <-results:
        return process(r)
    case <-ctx.Done():
        return ctx.Err()
        // If ctx times out, goroutine is still running callExternalService
        // Even after function returns, the goroutine won't exit
    }
}

// Scenario 3: Forgetting to close channel causes range to block forever
func leak3() {
    ch := make(chan int)
    go func() {
        for v := range ch { // blocks forever because ch is never closed
            process(v)
        }
    }()
    ch <- 1
    ch <- 2
    // forgot close(ch)
}

// Scenario 4: Mutual waiting (goroutine deadlock)
func leak4() {
    ch1 := make(chan int)
    ch2 := make(chan int)
    go func() {
        <-ch1   // wait for ch1
        ch2 <- 1 // then send to ch2
    }()
    go func() {
        <-ch2   // wait for ch2
        ch1 <- 1 // then send to ch1
    }()
    // Both goroutines wait on each other forever
}

Detection methods:

Method 1: runtime.NumGoroutine() monitoring

// Check for goroutine leaks in tests
func TestNoLeak(t *testing.T) {
    before := runtime.NumGoroutine()
    
    // Execute code under test
    doSomething()
    
    // Wait for goroutines to exit
    time.Sleep(100 * time.Millisecond)
    
    after := runtime.NumGoroutine()
    if after > before {
        t.Errorf("goroutine leak: before=%d after=%d", before, after)
    }
}

Method 2: goleak library (open-sourced by Uber)

import "go.uber.org/goleak"

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

// Or in individual tests
func TestFoo(t *testing.T) {
    defer goleak.VerifyNone(t)
    // ... test code ...
}

goleak works by capturing all goroutine stacks at test completion, filtering out known system goroutines (from runtime, testing packages), and failing if any other goroutines remain alive.

Method 3: pprof goroutine profile

import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    go func() {
        http.ListenAndServe(":6060", nil) // expose pprof endpoints
    }()
    // ...
}

// Visit http://localhost:6060/debug/pprof/goroutine?debug=1
// to see all goroutine stacks

// Or via command line:
// go tool pprof http://localhost:6060/debug/pprof/goroutine
// (pprof) top    — see which functions created the most goroutines
// (pprof) traces — see all goroutine call stacks

Method 4: Continuous monitoring + alerting

// Expose goroutine count in Prometheus
import "github.com/prometheus/client_golang/prometheus"

var goroutineGauge = prometheus.NewGaugeFunc(
    prometheus.GaugeOpts{
        Name: "go_goroutines",
        Help: "Number of goroutines that currently exist.",
    },
    func() float64 { return float64(runtime.NumGoroutine()) },
)

// Alert rule (Prometheus AlertManager):
// alert: GoroutineLeak
//   expr: go_goroutines > 10000
//   for: 5m

4.2 Performance Impact of Incorrect GOMAXPROCS

Scenario 1: Not adapting to container CPU quota

This is the most common production issue. Suppose a Kubernetes Pod is configured with resources.limits.cpu: "2" (2 cores), but the host has 64 cores:

GOMAXPROCS = 64 (wrong!)
├── Creates 64 Ps
├── 64 Ms compete for 2 cores of CPU time
├── Massive context switching (Linux CFS throttles)
├── Scheduling latency increases
└── Actual throughput WORSE than GOMAXPROCS=2

// Fix
import _ "go.uber.org/automaxprocs" // auto-detects CFS quota in init()

// How automaxprocs works:
// 1. Reads /sys/fs/cgroup/cpu/cpu.cfs_quota_us
// 2. Reads /sys/fs/cgroup/cpu/cpu.cfs_period_us
// 3. GOMAXPROCS = quota / period (rounded up)

Scenario 2: Different strategies for CPU-bound vs I/O-bound

// CPU-bound: GOMAXPROCS = CPU core count (default is optimal)
// Encryption, image processing, numerical simulation, etc.
func cpuBound() {
    runtime.GOMAXPROCS(runtime.NumCPU()) // this IS the default
}

// I/O-bound: GOMAXPROCS = CPU core count is already sufficient
// Because goroutines blocking on I/O don't occupy P; P executes other Gs
// No need to increase GOMAXPROCS

// Mixed workloads: special cases may need tuning
// If there are many CGO calls (CGO calls pin M), you may need to increase GOMAXPROCS

Scenario 3: Special uses of GOMAXPROCS=1

// Setting GOMAXPROCS to 1 simplifies concurrency reasoning
runtime.GOMAXPROCS(1)
// All goroutines can only be concurrent, never parallel
// Useful in certain test scenarios (reproducing race conditions)
// But note: this is NOT a substitute for the race detector

Benchmark: GOMAXPROCS impact on different workloads

package main

import (
    "crypto/sha256"
    "fmt"
    "runtime"
    "sync"
    "time"
)

func benchCPU(procs int) time.Duration {
    runtime.GOMAXPROCS(procs)
    start := time.Now()
    var wg sync.WaitGroup
    for i := 0; i < 8; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            data := make([]byte, 1024)
            for j := 0; j < 100_000; j++ {
                sha256.Sum256(data)
            }
        }()
    }
    wg.Wait()
    return time.Since(start)
}

func main() {
    for _, p := range []int{1, 2, 4, 8, 16} {
        d := benchCPU(p)
        fmt.Printf("GOMAXPROCS=%2d: %v\n", p, d)
    }
}
// Typical output (8-core machine):
// GOMAXPROCS= 1: 12.3s
// GOMAXPROCS= 2: 6.2s
// GOMAXPROCS= 4: 3.1s
// GOMAXPROCS= 8: 1.6s   ← optimal
// GOMAXPROCS=16: 1.7s   ← beyond core count, no improvement

4.3 High-Frequency Interview Questions

Question 1: Draw and explain the GMP model

Key points for your answer:

┌─────────────────────────────────────────────────────────┐
│                     Go Process                          │
│                                                         │
│  ┌───────────────── Global Run Queue ──────────────┐    │
│  │  [G] [G] [G] ...  (mutex protected, low freq)    │    │
│  └────────────────────────────────────────────────────┘  │
│                                                         │
│  P0 ─────────── P1 ─────────── P2 ─────────── ...     │
│  │ LRQ:[G][G]  │ LRQ:[G]      │ LRQ:[G][G][G]        │
│  │ runnext: G  │ runnext: nil  │ runnext: G           │
│  │ mcache      │ mcache        │ mcache               │
│  │             │               │                       │
│  │ ↕ bound     │ ↕ bound       │ ↕ bound              │
│  │             │               │                       │
│  M0 (thread)  M1 (thread)     M2 (thread)             │
│  │ executing G │ executing G   │ executing G           │
│                                                         │
│  Idle M list: M3, M4 (waiting for P)                   │
│  sysmon: independent M, no P binding                   │
└─────────────────────────────────────────────────────────┘

Data flow:
1. go func() → new G placed in current P's LRQ
2. schedule() → take G from LRQ and execute
3. LRQ empty → steal half from another P's LRQ
4. syscall → P unbinds from M (handoff)
5. G running >10ms → sysmon sends SIGURG to preempt

Question 2: Differences between goroutines and threads

Dimension	goroutine	OS Thread
Created by	Go runtime	OS kernel
Initial stack	2 KB, dynamically grows	1-8 MB, fixed
Scheduling	Go runtime M:N scheduling	Kernel 1:1 scheduling
Switch cost	~200 ns (userspace)	~1-10 μs (kernel mode)
ID	Has one but intentionally not exposed	pthread_t / tid
Communication	channel (CSP)	Shared memory + locks
Preemption	Runtime signal (10ms)	Kernel clock interrupt (~6ms)
Count limit	Millions	Thousands (limited by stack memory)

Question 3: Why isn't goroutine ID exposed?

The Go team intentionally does not provide a public API to get goroutine IDs in the runtime package. The reasons:

Exposing goroutine IDs would tempt developers to write "thread-local storage" style code (storing/retrieving state by ID), violating Go's philosophy of "share memory by communicating"
Goroutines should be anonymous, interchangeable execution units without "identity"
If truly needed (e.g., request tracing in logs), you should pass request-level identifiers via context.Context

// While you CAN hack out the goroutine ID, this is NOT recommended
import "runtime"

func goid() int64 {
    var buf [64]byte
    n := runtime.Stack(buf[:], false)
    // Parse "123" from "goroutine 123 [running]:"
    // This is an extreme hack — do NOT use in production code
}

Question 4: When does a goroutine get scheduled away?

Channel operation blocks (send/receive/select)
Mutex/RWMutex contention failure
time.Sleep / time.After
Network I/O (goes through netpoller underneath)
System call (file I/O, etc.)
runtime.Gosched() voluntary yield
Running beyond 10ms triggers signal preemption
GC STW (Stop The World)
Stack growth needed (rarely causes scheduling, but pauses execution)

Question 5: What happens when a goroutine calls LockOSThread()?

runtime.LockOSThread()
// Effects:
// 1. Current G is bound to current M; no other G will run on this M
// 2. Current M will only execute this G
// 3. If this G creates new Gs, they run on other Ms
// 4. Binding is released when G exits or calls UnlockOSThread()

// Use cases:
// - CGO calls (some C libraries require same-thread calling)
// - GUI frameworks (main thread restriction)
// - Specific Linux namespace operations (e.g., setns)

4.4 Practical Debugging: GODEBUG=schedtrace

The Go runtime provides powerful built-in debugging tools for observing real-time scheduler behavior without code changes.

schedtrace basic usage:

# Print scheduler state every 1000ms
GODEBUG=schedtrace=1000 ./myapp

# Example output:
# SCHED 0ms: gomaxprocs=8 idleprocs=5 threads=10 spinningthreads=2
#   idlethreads=3 runqueue=0 [2 0 1 0 3 0 0 1]

Field meanings:

Field	Meaning
`gomaxprocs=8`	Number of Ps
`idleprocs=5`	Number of idle Ps
`threads=10`	Total M (OS thread) count
`spinningthreads=2`	Spinning Ms (looking for work)
`idlethreads=3`	Sleeping Ms
`runqueue=0`	Gs in global run queue
`[2 0 1 0 3 0 0 1]`	Gs in each P's local queue

scheddetail for more verbose output:

GODEBUG=schedtrace=1000,scheddetail=1 ./myapp

# Also outputs detailed state for each P and M:
# P0: status=1 schedtick=3423 syscalltick=88 m=0 runqsize=2 gfreecnt=5
# M0: p=0 curg=17 mallocing=0 throwing=0 preemptoff= locks=0
# G17: status=2(running) m=0 lockedm=-1

Real-world debugging case: Diagnosing scheduling latency

// Symptom: Some HTTP requests suddenly slow (P99 tail latency spikes from 5ms to 50ms)
// Using schedtrace:

// SCHED 3000ms: gomaxprocs=4 idleprocs=0 threads=12 spinningthreads=0
//   idlethreads=8 runqueue=47 [128 64 0 0]

// Findings:
// 1. idleprocs=0 — all Ps are busy
// 2. runqueue=47 — 47 Gs queued in global queue
// 3. [128 64 0 0] — P0 and P1 severely backlogged, P2/P3 empty

// Diagnosis: P2 and P3 likely stuck in long syscalls (P transferred after handoff)
// But new Gs are still concentrated on P0/P1 (created from those contexts)
// Solution: Check if goroutines are doing long synchronous file I/O

Using the execution tracer for deeper analysis:

import "runtime/trace"

func main() {
    f, _ := os.Create("trace.out")
    trace.Start(f)
    defer trace.Stop()
    
    // ... your program ...
}

// Then view with go tool trace:
// go tool trace trace.out
// You can see:
// - Timeline of goroutines on each P
// - Latency from G creation to execution
// - Work stealing events
// - P handoffs due to syscalls
// - GC STW duration and impact

# Capture trace from a running program (via pprof HTTP)
curl -o trace.out 'http://localhost:6060/debug/pprof/trace?seconds=5'
go tool trace trace.out

Other GODEBUG options related to scheduling:

# Combine multiple options
GODEBUG=schedtrace=1000,gctrace=1,madvdontneed=1 ./myapp

# gctrace=1 — GC logs (shows STW duration, affects scheduling)
# asyncpreemptoff=1 — disable async preemption (for comparison testing)
# tracebackancestors=N — goroutine creation chain trace depth

4.5 Production Best Practices

Practice 1: Use context to control goroutine lifecycle

// Standard pattern: all goroutines should be cancellable
func worker(ctx context.Context, tasks <-chan Task) error {
    for {
        select {
        case <-ctx.Done():
            return ctx.Err() // graceful exit
        case task, ok := <-tasks:
            if !ok {
                return nil // channel closed
            }
            if err := process(ctx, task); err != nil {
                return err
            }
        }
    }
}

// Pass a cancellable context at startup
ctx, cancel := context.WithCancel(context.Background())
defer cancel() // ensures all worker goroutines exit

for i := 0; i < numWorkers; i++ {
    go worker(ctx, tasks)
}

Practice 2: Limit concurrent goroutine count

// Semaphore pattern to limit concurrency
sem := make(chan struct{}, maxConcurrency)

for _, item := range items {
    sem <- struct{}{} // acquire semaphore, may block
    go func(item Item) {
        defer func() { <-sem }() // release semaphore
        process(item)
    }(item)
}

// Or use errgroup
import "golang.org/x/sync/errgroup"

g, ctx := errgroup.WithContext(ctx)
g.SetLimit(maxConcurrency) // Go 1.20+

for _, item := range items {
    item := item
    g.Go(func() error {
        return process(ctx, item)
    })
}
if err := g.Wait(); err != nil {
    // handle error
}

Practice 3: Avoid goroutine creation in hot paths

// Anti-pattern: creating a goroutine per request for timeout control
func handleRequest(w http.ResponseWriter, r *http.Request) {
    done := make(chan struct{})
    go func() {
        result := doWork()     // new goroutine per request
        sendResponse(w, result)
        close(done)
    }()
    select {
    case <-done:
    case <-time.After(5 * time.Second):
        http.Error(w, "timeout", 504)
    }
}

// Better: use context timeout (no extra goroutine needed)
func handleRequest(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()
    
    result, err := doWork(ctx) // execute in current goroutine
    if err != nil {
        if ctx.Err() == context.DeadlineExceeded {
            http.Error(w, "timeout", 504)
            return
        }
        http.Error(w, err.Error(), 500)
        return
    }
    sendResponse(w, result)
}

Practice 4: Monitor goroutine trends, not absolute counts

// Don't just look at absolute numbers — look at trends
// Goroutine count slowly but continuously growing = leak
// Goroutine count fluctuating with load but bounded = normal

// Grafana alert rule pseudocode:
// IF deriv(go_goroutines[1h]) > 10
//    AND go_goroutines > 5000
// THEN alert("possible goroutine leak")

4.6 Edge Cases and Traps

Trap 1: "Pseudo-deadlock" with GOMAXPROCS=1

runtime.GOMAXPROCS(1)
ch := make(chan int)

go func() {
    ch <- 1
}()

// If the current goroutine does heavy computation here
// and Go version < 1.14 (no async preemption)
// the new goroutine never gets a chance to run
// ch <- 1 never happens
// program hangs (not a deadlock — it's starvation)
for {
    // tight loop without function calls
}
<-ch // never reached

Trap 2: CGO calls pin M

// During CGO calls, M is pinned (LockOSThread effect)
// If many goroutines simultaneously make CGO calls
// M count can explode to the 10,000 limit

/*
#include <unistd.h>
void slow_c_func() {
    sleep(10); // blocks for 10 seconds
}
*/
import "C"

func main() {
    for i := 0; i < 10001; i++ {
        go func() {
            C.slow_c_func() // each goroutine pins an M
        }()
    }
    // May trigger "thread exhaustion" and crash
}
// Solution: use a worker pool to limit concurrent CGO calls

Trap 3: GC STW impact on latency

// GC's STW (Stop The World) phase pauses all goroutines
// In Go 1.18+ STW is typically only tens of microseconds
// But with many goroutines, the STW "barrier" effect increases latency

// Observe GC STW duration:
// GODEBUG=gctrace=1 ./myapp
// gc 1 @0.012s 2%: 0.044+1.2+0.033 ms clock, 0.35+0.8/1.1/0+0.26 ms cpu ...
//                   ^^^                ^^^
//                   STW mark start     STW mark termination

// Reducing GC pressure = fewer STWs = lower scheduling latency

Trap 4: runtime.LockOSThread inheritance behavior

func init() {
    // main goroutine locks OS thread in init
    runtime.LockOSThread()
    // Note: if main goroutine exits without UnlockOSThread
    // the thread is destroyed rather than returned to the idle pool
}

// Also, LockOSThread is nestable
runtime.LockOSThread()
runtime.LockOSThread() // count +1
runtime.UnlockOSThread() // count -1, still locked
runtime.UnlockOSThread() // count reaches zero, unlocked

Chapter Summary

The GMP scheduler is one of Go's most critical runtime components. Its design elegantly balances multiple objectives:

Efficient creation: goroutine creation is a pure userspace operation costing ~0.3μs
Fair scheduling: work stealing ensures load balance; signal preemption prevents starvation
Syscalls don't block scheduling: P/M unbinding (handoff) ensures CPUs stay utilized
Scalability: local lock-free queues eliminate global lock bottlenecks

Understanding the GMP model isn't just for interviews — it helps you:

Diagnose goroutine leaks and scheduling latency issues
Configure GOMAXPROCS correctly (especially in container environments)
Understand why certain operations (channel, mutex, I/O) trigger scheduling
Write concurrent code that is friendly to the scheduler

Recommended Further Reading:

Dmitry Vyukov, "Scalable Go Scheduler Design Doc" (2012) — the original GMP model design document
Blumofe & Leiserson, "Scheduling Multithreaded Computations by Work Stealing" (JACM 1999) — work stealing theoretical foundation
Austin Clements, "Proposal: Non-cooperative goroutine preemption" (Go proposal #24543, 2018) — async preemption proposal
Go source code runtime/proc.go — complete scheduler implementation (~6000 lines)
go tool trace documentation — the best tool for visualizing scheduling behavior

Rate this chapter

4.6 / 5 (25 ratings)