Chapter 16

Garbage Collection: Tri-Color Marking and Write Barriers

Chapter 16: Garbage Collection — Tri-Color Marking and Write Barriers

Every Go program silently runs a concurrent janitor — the Garbage Collector (GC). The memory you allocate with make([]byte, 1024) never needs manual freeing; the variables captured by your closures never need explicit lifetime tracking. Behind this apparent "free lunch" lies one of the most sophisticated subsystems in the Go runtime.

This chapter does not aim to inform you that "Go has a GC" — any introductory tutorial covers that fact. Instead, we answer: what exactly is the GC doing while your program runs? How does each of its decisions affect your latency, throughput, and memory footprint? When your production service shows P99 latency spikes, how do you determine if GC is the culprit, and how do you tame it?

Level 1: What You Need to Know

Why Garbage Collection Exists

In the world of C and C++, the programmer is both master and servant of memory.

// C: Manual memory management
char *buf = malloc(1024);
// ... use buf ...
free(buf);
// What if we use buf again? Use-after-free, undefined behavior
buf[0] = 'A';  // 💥 Might crash, might silently corrupt data

Manual memory management inflicts three categories of pain:

1. Use-After-Free

This is among the most dangerous bug classes in C/C++. Google's 2023 report revealed that approximately 70% of high-severity security vulnerabilities in Chrome are related to memory safety issues, with use-after-free leading the list. You free a block of memory, but another pointer in your code still references that location. Reading or writing through this dangling pointer might return garbage data, overwrite another object's memory, or — worst case — be exploited by an attacker to execute arbitrary code.

2. Memory Leaks

Memory is allocated but never freed. In long-running server processes, this means memory usage grows continuously until OOM (Out of Memory) kills the process. More insidious: you do call free, but complex ownership relationships or circular references cause certain code paths to skip the deallocation.

3. Double Free

Calling free on the same memory block twice. This corrupts the allocator's internal data structures, causing subsequent malloc calls to return already-in-use memory regions, leading to data corruption.

Go eliminates all three categories through garbage collection. The cost? CPU time and latency — the GC must scan memory and track object reference relationships. This is an engineering trade-off, not a free lunch.

Core Characteristics of Go's GC

Go's GC has four core characteristics, each a deliberate design choice:

Characteristic	Meaning	Design Rationale
Concurrent	GC runs simultaneously with user goroutines	Minimizes STW pauses for low-latency requirements
Non-generational	No young/old generation distinction	Go's escape analysis already filters most short-lived objects at compile time
Non-compacting	Does not relocate live objects	Avoids the cost of updating all pointers; simplifies concurrent marking
Mark-Sweep	First marks live objects, then reclaims unmarked ones	The most fundamental and flexible GC algorithm framework

Compared to Java's G1/ZGC, Go's GC design philosophy is "simple, predictable, low-latency." Java's generational GC pursues high throughput through complex heuristic algorithms; Go's GC pursues consistent low pauses through straightforward full-heap scanning.

GC Trigger Condition: GOGC

The GC does not run constantly — it requires a trigger condition. Go uses the GOGC environment variable (or debug.SetGCPercent()) to control when GC fires.

Basic rule: When the heap grows to (1 + GOGC/100) times the size of live objects after the last GC, the next GC cycle triggers.

// Default GOGC=100
// Suppose 100MB of live objects after last GC
// Next GC trigger: 100MB × (1 + 100/100) = 200MB
// i.e., GC triggers when heap doubles

// GOGC=50 → more frequent GC
// Trigger: 100MB × 1.5 = 150MB

// GOGC=200 → less frequent GC
// Trigger: 100MB × 3 = 300MB

// GOGC=off → disable GC (dangerous! only for special scenarios)

Intuition: GOGC is a "memory-for-CPU" knob. Increase GOGC: fewer GC cycles (lower CPU overhead), but higher peak memory. Decrease GOGC: tighter memory usage, but more frequent GC runs.

package main

import (
    "fmt"
    "runtime"
    "runtime/debug"
)

func main() {
    // Read current GOGC setting
    fmt.Println("GOGC:", debug.SetGCPercent(-1)) // reads current value
    debug.SetGCPercent(100) // restore default

    // Read GC statistics
    var stats runtime.MemStats
    runtime.ReadMemStats(&stats)
    fmt.Printf("Completed GC cycles: %d\n", stats.NumGC)
    fmt.Printf("Heap allocated bytes: %d\n", stats.HeapAlloc)
    fmt.Printf("Cumulative GC pause: %v ns\n", stats.PauseTotalNs)
}

Go 1.19+: GOMEMLIMIT Soft Memory Limit

GOGC has a fundamental problem: it only considers growth ratio, not absolute values. If your container memory limit is 1GB, GOGC has no awareness of that ceiling.

Go 1.19 introduced GOMEMLIMIT as a soft memory limit:

// Set via environment variable
// GOMEMLIMIT=512MiB

// Or set programmatically
import "runtime/debug"
debug.SetMemoryLimit(512 * 1024 * 1024) // 512 MiB

GOMEMLIMIT behavior:

When memory usage approaches the limit, GC runs more aggressively (even before the GOGC threshold)
It is a "soft" limit — if GC cannot reclaim enough memory, the heap may exceed this value
It is not a hard limit and will not cause an OOM panic

Recommended configuration strategy:

# Recommended for container environments
# Container memory limit 1GB, give Go process 80% (rest for OS, sidecars, etc.)
GOMEMLIMIT=800MiB
GOGC=100  # Keep default, let GOMEMLIMIT intervene near the ceiling

Common mistake:

// Wrong: GOGC=off with GOMEMLIMIT
// Problem: If all objects are live, GC cannot reclaim anything
// Result: GC runs constantly but reclaims nothing → 100% CPU → death spiral
// The runtime caps GC CPU at 50% in this case, but the service is effectively dead

// Correct: Keep a reasonable GOGC, let GOMEMLIMIT serve only as a safety net
GOGC=100
GOMEMLIMIT=800MiB

Observing GC Behavior

Before debugging and tuning, you first need to "see" what the GC is doing:

# Method 1: GODEBUG environment variable (simplest)
GODEBUG=gctrace=1 ./your-program

# Example output:
# gc 1 @0.001s 2%: 0.003+0.35+0.003 ms clock, 0.029+0.11/0.36/0.001+0.023 ms cpu, 4->4->0 MB, 4 MB goal, 0 MB stacks, 0 MB globals, 8 P
# Interpretation:
#   gc 1          → 1st GC cycle
#   @0.001s       → 0.001 seconds after program start
#   2%            → This GC used 2% of available CPU
#   0.003+0.35+0.003 ms clock → STW1 + concurrent mark + STW2 wall-clock time
#   4->4->0 MB    → heap at GC start → heap at GC end → live object size
#   4 MB goal     → target heap size for next GC trigger
#   8 P           → number of processors used

// Method 2: runtime/metrics package (Go 1.16+, programmatic)
import "runtime/metrics"

func printGCMetrics() {
    samples := []metrics.Sample{
        {Name: "/gc/cycles/total:gc-cycles"},
        {Name: "/gc/pauses/total:seconds"},
        {Name: "/memory/classes/heap/objects:bytes"},
    }
    metrics.Read(samples)
    
    fmt.Printf("Total GC cycles: %d\n", samples[0].Value.Uint64())
    // ...
}

Level 2: How It Works

The Tri-Color Marking Algorithm

The core algorithm of Go's GC is tri-color marking, originally proposed by Dijkstra et al. in their 1978 paper "On-the-fly Garbage Collection: An Exercise in Cooperation."

The meaning of three colors:

Color	Meaning	Object State
White	Potentially garbage	Not yet scanned; objects still white at GC end will be reclaimed
Grey	Discovered but not fully scanned	Known live, but its references have not been examined
Black	Fully scanned	Known live, and all its references have been discovered (greyed or blacked)

Algorithm flow:

Initial state:
  - All objects marked white
  - Objects directly referenced by GC roots (globals, stack pointers, registers) marked grey

Loop (until grey set is empty):
  1. Remove an object O from the grey set
  2. Mark O black
  3. Scan all pointer fields of O
  4. For each white object P referenced by O:
     - Mark P grey (it is live)

Termination:
  - Grey set is empty
  - Now: black objects = live, white objects = garbage

A concrete example:

Suppose the reference graph is:
  Root → A → B → C
  Root → D
  E (unreferenced)

Step 0: All white
  White: {A, B, C, D, E}  Grey: {}  Black: {}

Step 1: Root scan, grey directly referenced A and D
  White: {B, C, E}  Grey: {A, D}  Black: {}

Step 2: Take A, mark black, scan A's references → grey B
  White: {C, E}  Grey: {B, D}  Black: {A}

Step 3: Take B, mark black, scan B's references → grey C
  White: {E}  Grey: {C, D}  Black: {A, B}

Step 4: Take C, mark black, C has no references
  White: {E}  Grey: {D}  Black: {A, B, C}

Step 5: Take D, mark black, D has no references
  White: {E}  Grey: {}  Black: {A, B, C, D}

Done: E is still white → reclaim E's memory

The Tri-Color Invariant

In STW (Stop-The-World) mode, tri-color marking is straightforward — pause all goroutines and scan in peace. But Go's GC is concurrent, meaning user goroutines continue modifying reference relationships while GC scans.

Consider this dangerous scenario:

Time T1: GC has marked A black, B is white
  A(black) → B(white)   C(grey) → B(white)

Time T2: User goroutine executes:
  A.ref = B      // A maintains or gains a reference to B
  C.ref = nil    // C no longer references B

Time T3: GC continues scanning C, finds C has no references → B stays white

Result: B is live (A references it) but is incorrectly reclaimed!

This is the "lost object" problem. To prevent it, the tri-color invariant must be maintained:

Strong Tri-Color Invariant: A black object must never directly reference a white object.

Weak Tri-Color Invariant: A black object may reference a white object, but that white object must be reachable from some grey object via a chain of references.

Go uses the weak tri-color invariant — enforced through write barriers. As long as a path from a grey object to the white object exists, the white object will not be incorrectly reclaimed.

Write Barriers

A write barrier is a snippet of code automatically inserted at every pointer assignment. The compiler, when generating machine code, injects additional instructions before and after all heap pointer writes.

Go's Hybrid Write Barrier (Go 1.8+):

Go uses a combination of Dijkstra's insertion barrier and Yuasa's deletion barrier:

// Pseudocode: Hybrid write barrier
// When executing slot = ptr (i.e., *slot = ptr):
func writeBarrier(slot *unsafe.Pointer, ptr unsafe.Pointer) {
    // Yuasa part: record the old value being overwritten (deletion barrier)
    shade(*slot)  // grey the old value
    
    // Dijkstra part: record the new value (insertion barrier)
    shade(ptr)    // grey the new value
    
    // Perform the actual write
    *slot = ptr
}

// shade function: if object is white, mark it grey
func shade(ptr unsafe.Pointer) {
    if ptr != nil && isWhite(ptr) {
        markGrey(ptr)
    }
}

Why hybrid?

Pure Dijkstra write barrier (insertion barrier): Only tracks newly written pointers. Problem: stack writes have no barrier (for performance), so GC must re-scan all goroutine stacks at mark termination. With many goroutines, this re-scan STW time becomes unacceptable.
Pure Yuasa write barrier (deletion barrier): Only tracks overwritten old values. Problem: requires a snapshot-at-the-beginning of the entire heap at GC start — complex to implement and memory-intensive.
Hybrid write barrier (Go 1.8+): Combines both. No write barrier needed on stacks; GC only needs to scan each stack once at start (no re-scan). Specific rules:
1. At GC start, all objects on stacks are marked black
2. Heap pointer writes shade both old and new values
3. Newly created stack objects default to black

STW Phases

Despite being "concurrent," Go's GC still has two brief STW (Stop-The-World) phases:

STW 1: Mark Setup

Stop all goroutines
Enable write barriers
Mark root objects on each goroutine's stack as grey
Resume all goroutines
Typical duration: 10-30 microseconds

STW 2: Mark Termination

Stop all goroutines
Disable write barriers
Perform final cleanup work
Resume all goroutines
Typical duration: 10-30 microseconds (dramatically reduced after Go 1.8's hybrid write barrier)

The Four Phases of a GC Cycle

A complete GC cycle comprises four phases:

┌─────────────────────────────────────────────────────────────────┐
│                      GC Cycle                                    │
├───────────┬─────────────┬───────────────┬───────────────────────┤
│  Sweep    │  Mark       │  Mark         │  Sweep                │
│Termination│  (STW 1)    │  Termination  │                       │
│           │  + Marking  │  (STW 2)      │                       │
├───────────┼─────────────┼───────────────┼───────────────────────┤
│ Complete  │ Concurrently│ Complete      │ Concurrently          │
│ previous  │ mark all    │ marking;      │ reclaim white         │
│ sweep     │ live objects│ disable WB    │ objects' memory       │
└───────────┴─────────────┴───────────────┴───────────────────────┘

Phase 1: Sweep Termination

Ensures the previous GC cycle's sweep work is complete
Usually already finished in the background

Phase 2: Mark

STW to enable write barriers (extremely brief)
Then concurrently scans all GC roots and reachable objects
Mark worker goroutines run concurrently with user goroutines
GC uses approximately 25% of CPU (GOMAXPROCS/4 dedicated goroutines)

Phase 3: Mark Termination

STW to disable write barriers
Completes final marking cleanup
Computes the trigger threshold for next GC

Phase 4: Sweep

Concurrently reclaims memory occupied by white objects
Sweep is lazy — it clears spans on demand during allocation
This does not add allocation latency (amortized across each allocation)

GC Pacer: Dynamic Scheduling

The GC Pacer is the runtime component responsible for deciding "when to start the next GC." Its goals are:

Heap size does not exceed the target (determined by GOGC/GOMEMLIMIT)
GC CPU usage stays around 25%
Avoid triggering too early or too late

The Pacer uses a feedback controller to achieve these goals:

// Simplified Pacer logic (Go 1.18+ rewrite)
// Target heap = live objects × (1 + GOGC/100)
// Trigger point = target heap - expected allocation during marking

// If 100MB live after last GC, GOGC=100:
// Target heap = 200MB
// Expected marking-phase allocation = 20MB
// Trigger point = 200MB - 20MB = 180MB
// i.e., start GC when heap reaches 180MB, expect to finish marking at 200MB

The Pacer also adjusts predictions based on historical data:

If the last GC overshot the target (heap larger than expected), trigger earlier next time
If the last GC undershot (wasted CPU), trigger later next time

This adaptive mechanism ensures stable and predictable GC behavior.

Level 3: What the Specification Defines

The Evolution of Go's GC

The development of Go's GC is a history of relentlessly pursuing lower latency:

Go 1.0 (2012): Full STW

The earliest Go GC was a simple mark-sweep collector that required pausing the entire program. Pause times could reach hundreds of milliseconds or even seconds — catastrophic for network services.

Go 1.1 (2013): Precise GC

Previously, the GC was conservative — it could not distinguish integers from pointers, conservatively treating any value that looked like a pointer as one. Version 1.1 introduced precise GC: the compiler generates bitmaps for each type, marking which fields are pointers. This eliminated memory leaks caused by false pointer identification.

Go 1.3 (2014): Concurrent Sweep

The sweep phase became concurrent, no longer requiring STW. However, the mark phase remained STW.

Go 1.4 (2014): Runtime Rewritten in Go

The GC was rewritten from C to Go, laying the foundation for concurrent marking. Precise stack information made stack scanning more efficient.

Go 1.5 (2015): Concurrent GC

This was the milestone release. Rick Hudson announced the concurrent GC implementation at GopherCon 2015. The mark phase no longer required prolonged STW; pause times dropped from hundreds of milliseconds to sub-10ms. Used the Dijkstra write barrier, but required re-scanning all stacks at mark termination.

Go 1.6 (2016): Improved GC Scheduling

Introduced a better GC pacer algorithm, reducing mark assist interference with user goroutines.

Go 1.8 (2017): Hybrid Write Barrier

Austin Clements' hybrid write barrier eliminated stack re-scanning at mark termination. STW times dropped below 100 microseconds, essentially independent of heap size and goroutine count. This was a fundamental breakthrough in Go GC latency.

Go 1.12 (2019): Sweep Improvements

Improved the strategy for returning memory to the OS (MADV_FREE), reducing RSS (Resident Set Size) fluctuations.

Go 1.18 (2022): Pacer Rewrite

The GC Pacer was replaced with a new feedback controller (from the old PID controller), dramatically improving stability and accuracy. Based on Michael Knyszek's design document.

Go 1.19 (2022): GOMEMLIMIT

Introduced the soft memory limit, solving the problem of GOGC being unaware of container memory ceilings. This enabled the "high GOGC + memory limit" configuration pattern: reduce GC frequency when memory is abundant, automatically accelerate GC when memory is tight.

Correctness Proof of Dijkstra's Write Barrier

Dijkstra's write barrier (Dijkstra, Lamport, Martin, Scholten, Steffens, 1978, "On-the-fly Garbage Collection: An Exercise in Cooperation") is an insertion barrier:

// Dijkstra write barrier pseudocode
writePointer(slot, ptr):
    shade(ptr)       // grey the newly referenced object
    *slot = ptr

Theorem: Dijkstra's write barrier maintains the strong tri-color invariant.

Proof:

The strong tri-color invariant requires: no direct reference from a black object to a white object.

Suppose object A (black) executes A.field = B (B is white).

The write barrier triggers shade(B), marking B grey
After assignment: A (black) → B (grey), which does not violate the strong tri-color invariant ✓

Therefore, at any point, if a black object gains a new reference to some object, that object must have been greyed — it cannot remain white. The strong tri-color invariant is maintained. □

Limitation: Dijkstra's write barrier only protects heap writes. For performance reasons, Go does not enable write barriers on stacks (stack operations are extremely frequent). This means stack pointer modifications can break the invariant, so all goroutine stacks must be re-scanned at Mark Termination. When goroutine counts reach hundreds of thousands, re-scan STW time becomes unacceptable.

Correctness Proof of Yuasa's Write Barrier

Yuasa's write barrier (Yuasa, 1990, "Real-time garbage collection on general-purpose machines") is a deletion barrier (also called snapshot-at-the-beginning):

// Yuasa write barrier pseudocode
writePointer(slot, ptr):
    shade(*slot)     // grey the old value about to be overwritten
    *slot = ptr

Theorem: Yuasa's write barrier maintains the weak tri-color invariant.

Proof:

The weak tri-color invariant requires: if a black object references a white object, that white object must be reachable from some grey object.

Yuasa's barrier's core idea is "preserve the reachability snapshot at GC start." Any object reachable at GC start will not be incorrectly reclaimed at GC end.

Consider the following scenario: grey object C originally references white object B, then C.field = D (breaking C→B).

The write barrier triggers shade(B) (B is the overwritten old value), B is greyed
B will not be missed ✓

Even if black object A subsequently gains a reference to B (A.field = B), since B has already been greyed, it will not be incorrectly reclaimed.

Limitation: Yuasa's barrier may retain more "floating garbage" — objects that become garbage after GC starts cannot be reclaimed in this cycle, requiring the next GC cycle.

The Design and Correctness of the Hybrid Write Barrier

Go 1.8's hybrid write barrier (Austin Clements, 2016, proposal "Eliminate STW stack re-scanning") combines the advantages of both:

// Hybrid write barrier pseudocode
writePointer(slot, ptr):
    shade(*slot)    // Yuasa: protect old reference
    shade(ptr)      // Dijkstra: protect new reference
    *slot = ptr

Key Innovation: Special Stack Treatment

The hybrid write barrier's core insight is: if entire stacks are marked black at GC start (all stack objects considered live), then:

Stack writes need no write barrier (since stack objects are already black)
Heap writes use the hybrid barrier for protection
No stack re-scan needed at Mark Termination

Theorem: The hybrid write barrier maintains the weak tri-color invariant under the "stacks initially black" condition.

Proof sketch:

We consider two cases of pointer modification origin:

Case 1: Heap pointer modification heap_obj.field = ptr

shade(*slot) protects the old value (won't be lost due to disconnection)
shade(ptr) protects the new value (won't be lost due to black-referencing-white)
Weak tri-color invariant maintained ✓

Case 2: Stack pointer modification stack_var = ptr

No write barrier on stacks
But stacks are fully scanned and marked black at GC start
Newly allocated stack objects default to black
If a stack acquires a reference to a white object, that reference must have come from a heap object
That heap object triggered the write barrier when transferring the pointer (if passed through assignment), or the white object itself already has grey-path protection

Key invariant: Any white pointer that "leaks" from heap to stack has already been shaded at the heap end's write barrier. Because for a pointer to reach the stack, it must pass through a heap write (read from another heap object and written to stack), and that heap object either is itself grey (protecting the white object) or the pointer was previously shaded. □

Why Go Does Not Use Generational GC

This is a frequently asked question: Java, .NET, and Python all use generational GC — why doesn't Go?

The Generational Hypothesis:

"Most objects die young." Based on this observation, generational GC divides the heap into a Young Generation and an Old Generation, frequently reclaiming the young generation (Minor GC) and occasionally reclaiming the old generation (Major GC), thereby improving efficiency.

Why Go doesn't need it:

1. Escape analysis already does the "generational" work

Go's compiler performs escape analysis: if an object cannot escape its function scope, it is allocated on the stack. Stack objects are automatically reclaimed when the function returns — no GC involvement needed.

func processRequest() {
    // buf doesn't escape → allocated on stack → reclaimed at function return
    buf := make([]byte, 1024)
    // ...
}

In Java, nearly all objects are heap-allocated (escape analysis capability is weaker), so massive numbers of short-lived objects accumulate in the young generation — generational collection yields enormous benefit. In Go, most short-lived objects have already been filtered to the stack by escape analysis; the remaining heap objects have a more uniform lifetime distribution, dramatically reducing the benefit of generational collection.

2. Generational GC requires write barriers for cross-generation references

Generational GC must record all "old generation object references young generation object" pointers (Remembered Set). This requires checking whether each pointer write crosses generations, adding write barrier complexity and overhead.

Go's hybrid write barrier is already lightweight — enabled only during GC marking. If generations were introduced, write barriers would need to be permanently enabled (since cross-generation writes can happen at any time). For Go's workloads (numerous goroutines, frequent pointer operations), the overhead may not justify the benefit.

3. Go prioritizes simplicity and predictability

Generational GC introduces numerous tuning parameters (young generation size, promotion thresholds, Minor GC frequency, etc.) and complex behavioral modes. The Go team prefers simple, uniform behavior — "one GC for all workloads."

Comparison with Java G1/ZGC:

Feature	Go GC	Java G1	Java ZGC
Generational	No	Yes	Yes (JDK 21+)
Compacting	No	Yes	Yes
Concurrent marking	Yes	Yes	Yes
Concurrent sweep/reclaim	Yes	Partial (Mixed GC has STW)	Yes
STW pauses	<1ms (typical)	Few ms to tens of ms	<1ms
Max heap	Unlimited	TB scale	16TB
Tuning complexity	Low (GOGC + GOMEMLIMIT)	High (dozens of parameters)	Medium
Throughput	Medium	High	Medium-high

Core trade-off: Go GC sacrifices some throughput (full-heap scanning does more work than generational scanning) in exchange for simplicity and consistent low latency. For most network services, low latency matters more than high throughput.

Richard Hudson's GopherCon 2015 Talk

Rick Hudson's GopherCon 2015 talk "Getting to Go: The Journey of Go's Garbage Collector" is essential reading for understanding Go's GC design philosophy.

Core points:

"Don't collect 'em if you can't serve 'em." — If GC prevents your service from responding to requests, GC is doing more harm than good. Low latency is the first priority.
"The lower bound on GC latency is zero." — If all your data lives on the stack, the GC has zero work to do. Reducing heap allocations is the most effective "GC optimization."
Go 1.5's goal was "10ms STW" — In 2015, the Go team reduced GC pauses from hundreds of milliseconds to sub-millisecond. This required making both marking and sweeping concurrent, leaving only the extremely brief STW for enabling/disabling write barriers.
"We're going to trade a little throughput for latency." — The Go team explicitly accepted that GC throughput would not match Java's, because Go's target users (network services, microservices) are more latency-sensitive.

Hudson also shared Go GC's long-term vision:

Sub-millisecond pauses (achieved)
GC's impact on program behavior is predictable and observable (achieved via runtime/metrics)
Ultimate goal: make developers forget GC exists

Mathematical Model of GC: Steady-State Analysis

Understanding GC Pacer behavior requires some mathematical tools.

Symbol definitions:

$L$ = live heap size after last GC
$G$ = GOGC / 100 (default 1.0)
$T$ = target heap size = $L \times (1 + G)$
$A$ = allocation during mark phase (between trigger and mark end)
$S$ = scan work during mark phase
$R$ = CPU resources available during mark phase (GOMAXPROCS × 25%)

Trigger condition:

GC should trigger when heap reaches $T - A$, so that marking completes when heap reaches $T$.

$$\text{Trigger} = T - A = L \times (1 + G) - A$$

Pacer's feedback control:

Go 1.18+'s Pacer uses a proportional-integral (PI) controller:

$$\text{error} = \frac{\text{actual_heap} - T}{T}$$ $$\text{trigger_adjustment} = K_p \times \text{error} + K_i \times \sum \text{error}$$

where $K_p$ and $K_i$ are controller gain coefficients. This ensures:

If the heap exceeds the target at GC completion, trigger earlier next time (negative feedback)
Accumulated historical error ensures no steady-state bias

GOMEMLIMIT's effect:

When GOMEMLIMIT is set (denoted $M$):

$$T = \min(L \times (1 + G), M)$$

The target heap never exceeds the memory limit. When $L \times (1 + G) > M$, the effective GOGC becomes:

$$G_{\text{effective}} = \frac{M - L}{L} = \frac{M}{L} - 1$$

This is how GOMEMLIMIT automatically lowers the effective GOGC under memory pressure, making GC more aggressive.

Level 4: Edge Cases and Pitfalls

GOGC Tuning in Practice

Scenario 1: CPU-intensive applications (computation, compilation)

Characteristics: Few heap objects; CPU spent on GC is pure waste.

# Increase GOGC, reduce GC frequency
GOGC=200 ./compute-service

# Or if memory is abundant
GOGC=400 ./compute-service

Scenario 2: Memory-intensive applications (cache services)

Characteristics: Large quantities of long-lived heap objects; high GC scanning pressure.

# Use GOMEMLIMIT to cap usage, moderately increase GOGC
GOGC=200 GOMEMLIMIT=6GiB ./cache-service
# Container allocated 8GB memory, give Go process 6GB

Scenario 3: Low-latency services (trading systems)

Characteristics: Every GC pause potentially impacts P99 latency.

// Strategy: reduce heap allocations, give GC nothing to do
// 1. Pre-allocate buffers
var bufPool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 4096)
    },
}

// 2. Reasonable GOGC (don't go too low; frequent GC also adds latency)
// GOGC=100 is usually a good default
// If memory is abundant, GOGC=200 can reduce GC frequency

// 3. Manually trigger GC during non-critical periods
func periodicMaintenance() {
    // Proactively trigger GC during traffic valleys to avoid triggering at peaks
    runtime.GC()
}

Scenario 4: Batch processing / offline tasks

Characteristics: Latency irrelevant; only total processing time matters.

# Minimize GC cycles, maximize throughput
GOGC=1000 ./batch-processor
# Or disable GC entirely (short-lived processes that exit after completion)
GOGC=off ./one-shot-task

When to Use runtime.GC()

runtime.GC() forces an immediate full GC cycle. When should you use it?

Appropriate scenarios:

// 1. Clear environment before benchmarks
func BenchmarkFoo(b *testing.B) {
    runtime.GC() // Ensure garbage from previous allocations is cleared
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        foo()
    }
}

// 2. Proactively trigger during predictable low-traffic windows
func gcScheduler(ctx context.Context) {
    ticker := time.NewTicker(1 * time.Minute)
    defer ticker.Stop()
    for {
        select {
        case <-ticker.C:
            if isLowTraffic() {
                runtime.GC()
            }
        case <-ctx.Done():
            return
        }
    }
}

// 3. Immediately reclaim after releasing large temporary objects
func processLargeDataset(data []Record) {
    results := transform(data)
    data = nil  // Allow GC to reclaim
    runtime.GC() // Reclaim immediately, free memory for subsequent stages
    save(results)
}

Inappropriate scenarios:

// Wrong: calling runtime.GC() in the request path
func handleRequest(w http.ResponseWriter, r *http.Request) {
    // ❌ Never do this!
    // GC is a global operation affecting all goroutines
    runtime.GC()
    // ...
}

// Wrong: calling runtime.GC() at high frequency
func processItem(item Item) {
    process(item)
    runtime.GC() // ❌ GC once per item? Performance disaster
}

Go's pprof tool provides multiple GC-related analysis perspectives:

1. allocs profile: Identify allocation hotspots

import _ "net/http/pprof"
// Then visit http://localhost:6060/debug/pprof/allocs

// Or use command line:
// go tool pprof http://localhost:6060/debug/pprof/allocs

# Capture 30 seconds of allocation sampling
go tool pprof -alloc_space http://localhost:6060/debug/pprof/allocs

# In the pprof interactive interface
(pprof) top 20
(pprof) web  # Generate call graph
(pprof) list functionName  # View specific function's allocations

The allocs profile tells you "where memory is being allocated" — the first step in optimizing GC pressure.

2. heap profile: View current heap state

go tool pprof http://localhost:6060/debug/pprof/heap

# Two perspectives:
# -inuse_space: currently live objects on heap (default)
# -alloc_space: cumulative allocations (including reclaimed)

(pprof) top -inuse_space  # Who occupies the most heap memory
(pprof) top -alloc_space  # Who allocated the most memory (GC pressure source)

3. trace: View GC event timeline

# Capture 5 seconds of trace
curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out
go tool trace trace.out

In the trace view you can observe:

Start and end of each GC cycle
Precise duration of STW phases
Which goroutines were delayed by Mark Assist
How much CPU the GC consumed

4. Runtime statistics

func printGCStats() {
    var stats debug.GCStats
    debug.ReadGCStats(&stats)
    
    fmt.Printf("GC count: %d\n", stats.NumGC)
    fmt.Printf("Last GC pause: %v\n", stats.Pause[0])
    fmt.Printf("Longest GC pause: %v\n", stats.PauseQuantiles[len(stats.PauseQuantiles)-1])
    
    var mem runtime.MemStats
    runtime.ReadMemStats(&mem)
    fmt.Printf("Heap alloc: %d MB\n", mem.HeapAlloc/1024/1024)
    fmt.Printf("Heap objects: %d\n", mem.HeapObjects)
    fmt.Printf("GC CPU fraction: %.2f%%\n", mem.GCCPUFraction*100)
}

Coding Techniques to Reduce GC Pressure

Technique 1: Object Reuse with sync.Pool

// sync.Pool caches objects between GC cycles
// Prior to Go 1.13: Pool cleared every GC
// Go 1.13+: victim cache retains objects for one extra GC cycle

var bufferPool = sync.Pool{
    New: func() interface{} {
        return bytes.NewBuffer(make([]byte, 0, 4096))
    },
}

func processRequest(data []byte) string {
    buf := bufferPool.Get().(*bytes.Buffer)
    buf.Reset()
    defer bufferPool.Put(buf)
    
    // Use buf to process data...
    buf.Write(data)
    return buf.String()
}

// ⚠️ sync.Pool caveats:
// 1. Pool objects can be reclaimed by GC at any time — don't use for persistent storage
// 2. Always Reset state before Put
// 3. Don't pool objects that are too large (memory waste) or too small (pooling overhead not worth it)

Technique 2: Reduce Escapes

// Escape analysis determines stack vs. heap allocation
// Use -gcflags="-m" to inspect escape decisions:
// go build -gcflags="-m" ./...

// ❌ Escapes to heap: returning pointer to local variable
func newUser(name string) *User {
    u := User{Name: name}  // u escapes to heap
    return &u
}

// ✅ Avoid escape: let caller provide storage
func initUser(u *User, name string) {
    u.Name = name  // u controlled by caller, may remain on stack
}

// ❌ Escapes: interface{} parameters
func doSomething(v interface{}) { ... }
func caller() {
    x := 42
    doSomething(x)  // x is boxed to heap
}

// ✅ Avoid escape: use concrete types
func doSomethingInt(v int) { ... }

// ❌ Escapes: closure capture
func makeCounter() func() int {
    count := 0  // count escapes (captured by closure)
    return func() int {
        count++
        return count
    }
}

// ❌ Escapes: slice with runtime-determined size
func process(n int) {
    data := make([]byte, n)  // n is runtime value → escapes
    _ = data
}

// ✅ Avoid escape: compile-time constant size
func process2() {
    data := make([]byte, 1024)  // size known at compile time → may stay on stack
    _ = data
}

Technique 3: Pre-allocate Slices and Maps

// ❌ Multiple growths → multiple allocations → GC must reclaim old backing arrays
func collectIDs(users []User) []int64 {
    var ids []int64  // initial capacity 0
    for _, u := range users {
        ids = append(ids, u.ID)  // each grow allocates new array
    }
    return ids
}

// ✅ Single allocation
func collectIDs(users []User) []int64 {
    ids := make([]int64, 0, len(users))  // pre-allocate
    for _, u := range users {
        ids = append(ids, u.ID)  // no growth triggered
    }
    return ids
}

// Same principle for maps
m := make(map[string]int, expectedSize)

Technique 4: Struct Embedding to Avoid Extra Allocations

// ❌ Every field is a pointer → at least 3 heap allocations per Order
type Order struct {
    Customer *Customer
    Items    *[]Item
    Address  *Address
}

// ✅ Embed value types → single allocation for entire Order
type Order struct {
    Customer Customer
    Items    []Item
    Address  Address
}

Technique 5: Avoid fmt.Sprintf on Hot Paths

// ❌ fmt.Sprintf uses interface{} internally → arguments escape + reflection
func buildKey(prefix string, id int64) string {
    return fmt.Sprintf("%s:%d", prefix, id)
}

// ✅ Use strconv + string concatenation
func buildKey(prefix string, id int64) string {
    return prefix + ":" + strconv.FormatInt(id, 10)
}

// ✅ Or use strings.Builder (for complex concatenation)
func buildComplexKey(parts ...string) string {
    var b strings.Builder
    for i, p := range parts {
        if i > 0 {
            b.WriteByte(':')
        }
        b.WriteString(p)
    }
    return b.String()
}

Technique 6: Zero-Copy []byte/string Conversion (Use with Caution)

import "unsafe"

// Convert []byte to string without allocation
// ⚠️ Precondition: must not modify the original []byte after conversion!
func bytesToString(b []byte) string {
    return unsafe.String(unsafe.SliceData(b), len(b))
}

// Convert string to []byte without allocation
// ⚠️ Precondition: must not modify the returned []byte!
func stringToBytes(s string) []byte {
    return unsafe.Slice(unsafe.StringData(s), len(s))
}

// Go 1.22+ compiler automatically optimizes this in certain scenarios
// e.g., map lookup: m[string(byteSlice)] does not allocate

Real-World Case: GC-Induced P99 Latency Spikes

Case 1: Large Heap + High Allocation Rate

An API gateway service maintained 2GB of routing cache on the heap. GOGC=100 meant GC triggered at 4GB. Marking 2GB of live objects required ~200ms of CPU time (distributed across concurrent marking). But due to high request volume, the allocation rate reached 1GB/s, causing many goroutines to trigger Mark Assist — the GC forced goroutines allocating memory to help with marking, pausing their request processing.

Solution:

# 1. Increase GOGC to reduce GC frequency
GOGC=200 GOMEMLIMIT=6GiB ./gateway

# 2. Move routing cache to mmap or off-heap storage (excluded from GC scanning)

Case 2: Stack Scanning with Massive Goroutine Count

A push notification service maintained 1 million persistent connections, one goroutine each. Before Go 1.8, Mark Termination required re-scanning all stacks — 1 million goroutine stacks caused STW exceeding 100ms.

Solution: Upgrade to Go 1.8+ (hybrid write barrier eliminated stack re-scan). STW dropped to <1ms.

Case 3: Implicit Allocations from Timers

// ❌ Each timeout creates a new Timer → massive Timer objects on heap
func handleConn(conn net.Conn) {
    for {
        conn.SetReadDeadline(time.Now().Add(30 * time.Second))
        // time.Now() may allocate in certain scenarios
        // SetReadDeadline has internal allocations too
        buf := make([]byte, 1024)
        n, err := conn.Read(buf)
        // ...
    }
}

// ✅ Reuse Timer
func handleConn(conn net.Conn) {
    timer := time.NewTimer(30 * time.Second)
    defer timer.Stop()
    for {
        timer.Reset(30 * time.Second)
        // ...
    }
}

GC Tuning Checklist

When you suspect GC is a performance bottleneck, follow these diagnostic steps:

1. Confirm it's a GC problem
   □ GODEBUG=gctrace=1 to check GC frequency and pauses
   □ runtime.MemStats.GCCPUFraction > 5% indicates excessive GC CPU usage
   □ go tool trace to check if Mark Assist affects critical paths

2. Locate allocation hotspots
   □ go tool pprof -alloc_space to find top allocators
   □ go build -gcflags="-m" to review escape analysis results
   □ Prioritize optimizing paths with highest (call frequency × per-call allocation)

3. Reduce allocations
   □ sync.Pool for object reuse
   □ Pre-allocate slices/maps
   □ Eliminate unnecessary escapes
   □ Avoid interface{} boxing

4. Adjust GC parameters
   □ Increase GOGC when memory is abundant
   □ Set GOMEMLIMIT in container environments
   □ Verify GOMEMLIMIT < container memory × 0.8

5. Validate improvements
   □ Benchmark comparison (-benchmem)
   □ Canary deployment monitoring P99 latency changes
   □ Monitor GC pause time (/gc/pauses/total:seconds)

Interview FAQ

Q: Go's GC is concurrent, but does it still have STW? When?

A: There are two extremely brief STW phases: (1) enabling write barriers (Mark Setup), and (2) disabling write barriers (Mark Termination). Each typically takes <100 microseconds. Concurrent marking and concurrent sweeping do not require STW.

Q: How expensive are write barriers?

A: Write barriers are only enabled during the GC mark phase (a fraction of total time). When enabled, each heap pointer write executes 2-3 extra instructions (check + shade). Stack writes have no barrier. Measured overhead is typically <5% of total CPU time.

Q: Is GOGC=off safe?

A: Only safe in two scenarios: (1) short-lived processes (exit after completion), (2) combined with GOMEMLIMIT when you are confident sufficient garbage exists to reclaim. For long-running services, GOGC=off is extremely dangerous — if live objects grow continuously, eventual OOM is guaranteed.

Q: Why doesn't Go's GC compact memory?

A: Compaction means moving objects; moving objects means updating all pointers referencing them. In a concurrent GC, updating pointers requires extremely complex synchronization mechanisms (similar to Java ZGC's colored pointers). Go chose the simpler non-compacting design, mitigating fragmentation through a TCMalloc-style memory allocator.

Q: When are sync.Pool objects reclaimed?

A: Before Go 1.13: Pool cleared every GC cycle. Go 1.13+: uses a victim cache mechanism — at GC time, the current Pool is moved to the victim position; the next GC truly clears the victim. This means Pool objects survive at least one GC cycle, but no longer is guaranteed.

Q: How to achieve "zero GC"?

A: Strictly speaking, impossible (any heap allocation means GC). But you can approach zero GC pressure: (1) allocate all objects on stack (escape analysis), (2) use mmap/cgo-managed memory (excluded from GC scanning), (3) pre-allocate all needed objects and reuse them. Extreme case: some high-frequency trading systems pre-allocate all memory at startup and run with zero allocations.

Rate this chapter

4.5 / 5 (19 ratings)