Chapter 15

Memory Allocator: tcmalloc Ideas in Go

Memory Allocator: tcmalloc Ideas in Go

Every time you write make([]byte, 1024) or new(MyStruct), an elaborately designed mechanism springs into action. This mechanism determines where memory comes from, in what chunk sizes, how it is returned, and how the GC tracks which memory is still in use.

Most Go developers' understanding of this mechanism stops at "there's a GC that automatically reclaims memory." This understanding is sufficient for everyday coding, but when you start writing high-throughput services, investigating memory leaks, optimizing GC pause times, or trying to understand why your service's RSS suddenly spikes at a particular moment โ€” that surface-level understanding falls far short.

This chapter begins from first principles, explaining why Go's memory allocator is designed the way it is, how its four-tier hierarchy (mheap โ†’ mcentral โ†’ mcache โ†’ mspan) works, which allocation path different object sizes take, and what tools you can use to observe and optimize memory behavior.

Level 1: What You Need to Know

Why a Custom Allocator? The Limitations of malloc

Before discussing Go's allocator, we need to understand the problems it solves. The simplest question: why not just use the C standard library's malloc?

Problem 1: malloc has lock contention in multithreaded scenarios.

Traditional malloc maintains a global free list. Every allocation and deallocation requires acquiring a global lock and updating this list. This is fine in single-threaded programs, but in Go's context there may be thousands of goroutines running simultaneously on a small number of OS threads, generating millions of small object allocations per second. The global lock becomes a severe performance bottleneck.

Experimental data supports this: on an 8-core machine, throughput for malloc/free-intensive programs is often only 2-3x that of single-core, not the expected 8x โ€” lock contention consumes most of the parallelism benefit.

Problem 2: malloc's memory fragmentation.

Traditional malloc uses "best-fit" or "first-fit" strategies: it finds the most precisely fitting or first sufficiently large block in the free list. After long running times, the heap accumulates many small fragments that are individually too small to satisfy new large allocations but large in total. This external fragmentation causes programs to consume far more memory than they actually need.

Problem 3: malloc does not integrate with GC.

GC needs to know precisely which memory addresses store pointers (to trace the object reference graph). Memory allocated by traditional malloc carries no type information โ€” GC cannot distinguish whether a block contains a pointer or an integer. Go's allocator records a pointer map (pointer bitmap) for each object at allocation time, allowing GC to scan precisely without needing to use conservative GC.

tcmalloc's Key Insight: Thread-Local Caches

In 2001, Google's Sanjay Ghemawat and Paul Menage developed tcmalloc (Thread-Caching Malloc), specifically to solve the lock contention problem of multithreaded malloc.

tcmalloc's key insight is very simple: push the free lists for frequently used small objects from a global structure down to per-thread locals.

Traditional malloc:
  Thread1 โ”€โ”€โ”
  Thread2 โ”€โ”€โ”ผโ”€โ”€โ–ถ [global lock] โ†’ [free list] โ†’ allocation
  Thread3 โ”€โ”€โ”˜

tcmalloc:
  Thread1 โ†’ [local cache 1] โ†’ lock-free allocation (most cases)
  Thread2 โ†’ [local cache 2] โ†’ lock-free allocation
  Thread3 โ†’ [local cache 3] โ†’ lock-free allocation
               โ”‚
               โ–ผ (lock only when cache is exhausted)
          [central cache] โ†’ [heap]

Each thread has its own free list; small object allocation and deallocation complete locally with no lock at all. Only when the thread-local cache is exhausted (needs to request more from upper levels) or overflows (needs to return excess to upper levels) does interaction with the central cache occur โ€” and the central cache locks at per-size-class granularity rather than globally.

Size classes are another key design element of tcmalloc. Instead of allocating exactly the right amount for each object, requests are rounded up to one of a few dozen predefined "size classes." For example, a 17-byte allocation actually gets 24 bytes (the next size class); a 100-byte allocation gets 112 bytes.

The benefit: external fragmentation is eliminated. Each size class holds only objects of one size, so freed slots can be directly reused by new objects of the same size โ€” there is no "fragment too small to use" situation. The cost is internal fragmentation: each object wastes at most one size-class interval (typically no more than 12.5%). This trade-off is considered worthwhile in engineering practice.

Go's Allocator: A Redesign on tcmalloc's Shoulders

Go's memory allocator borrows tcmalloc's core ideas but redesigns them deeply, for three reasons:

  1. Goroutines are not threads. Go's goroutines are lightweight coroutines running on a small number of OS threads (M). If each OS thread had its own cache, cache-switching when goroutines migrate between threads would cause problems. Go's solution is to bind caches to P (logical processors) rather than OS threads. Since one P only runs one goroutine at a time, access to P-local data is inherently lock-free.

  2. Tight GC integration is required. The allocator must maintain metadata that GC needs: pointer bitmaps, object sizes, whether objects contain pointers (to reduce GC scan volume). This metadata is tightly bound to spans (the management unit for memory pages).

  3. Large objects and special types need support. Tiny objects (<16B), ordinary small objects (16Bโ€“32KB), and large objects (>32KB) take different allocation paths. noscan objects (containing no pointers) can be skipped during GC scanning, further reducing GC overhead.

Level 2: The Principle โ€” Four-Tier Memory Hierarchy

Overall Architecture

Go Memory Allocator Hierarchy

 goroutine allocation request
         โ”‚
         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  mcache (one per P)             โ”‚  โ† Lock-free, fastest path
โ”‚  tiny allocator (< 16B)         โ”‚
โ”‚  small free lists (16B ~ 32KB)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚ cache exhausted / overflow
             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  mcentral (one per size class)  โ”‚  โ† One lock per mcentral
โ”‚  list of spans with free slots  โ”‚
โ”‚  list of full spans             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚ span exhausted
             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  mheap (global, single)         โ”‚  โ† Global lock (rare operations)
โ”‚  treap of free spans            โ”‚
โ”‚  arena management               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚ heap memory insufficient
             โ–ผ
       OS (mmap / VirtualAlloc)

mspan: The Basic Unit of Memory Management

mspan is the basic unit of Go's memory management. It manages a range of contiguous memory pages, with each page being 8KB.

// runtime/mheap.go (simplified)
type mspan struct {
    next *mspan     // next span in linked list
    prev *mspan
    startAddr uintptr  // start address of the span
    npages    uintptr  // number of pages in the span
    spanclass spanClass  // size class (with noscan flag)
    freeindex uintptr   // index of next free slot
    nelems    uintptr   // total number of slots in the span
    allocBits  *gcBits  // bitmap: which slots are allocated
    gcmarkBits *gcBits  // bitmap: GC mark (which objects are alive)
    allocCount uint16   // number of allocated slots
}

Each mspan belongs exclusively to one size class. For example, a size-class-8 mspan holds only 8-byte objects; a size-class-24 mspan holds only 24-byte objects. This "homogeneous" design is the key to eliminating external fragmentation.

mspan lifecycle:

Free pages in mheap
       โ”‚ mheap.alloc() divides contiguous pages into a span
       โ–ผ
mcentral's empty span list
       โ”‚ mcache takes the span
       โ–ผ
Active span held by mcache
       โ”‚ object allocation (freeindex advances)
       โ–ผ
Span is full (all slots allocated)
       โ”‚ mcache returns span to mcentral
       โ–ผ
mcentral's full span list
       โ”‚ GC scans; free slots discovered
       โ–ผ
mcentral's partial span list (has free slots)
       โ”‚ span becomes completely empty
       โ–ผ
Returned to mheap's free page pool

The Size Class System

Go 1.22 has 67 small object size classes (plus size class 0, totaling 68), plus a large object path.

Sample size classes (partial):
class  bytes/obj  bytes/span  objects  tail waste  max waste
    1          8       8192     1024           0     87.50%
    2         16       8192      512           0     43.75%
    3         24       8192      341          8      29.24%
    4         32       8192      256           0     21.88%
    5         48       8192      170          32     31.52%
    6         64       8192      128           0     23.44%
    7         80       8192      102          32     19.07%
   ...
   36        512       8192       16           0      6.05%
   37        640      40960       64           0      9.99%
   ...
   66      32768      32768        1           0     12.50%

A request to allocate 18 bytes? Round up to 24 (size class 3), use a slot in mspan3.

The Tiny Allocator: Special Optimization for Very Small Objects

For objects < 16B that contain no pointers (like int8, bool, the content part of short strings, etc.), Go uses a dedicated tiny allocator.

Tiny allocator principle:

  16B tiny block
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  obj1 (2B)     โ”‚
  โ”‚  obj2 (4B)     โ”‚
  โ”‚  obj3 (3B) pad โ”‚  โ† padded to align to 4B
  โ”‚  [7B remaining]โ”‚  โ† tinyoffset points here
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Next allocation of 5B object:
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  obj1 (2B)     โ”‚
  โ”‚  obj2 (4B)     โ”‚
  โ”‚  obj3 (3B) pad โ”‚
  โ”‚  obj4 (5B) pad โ”‚  โ† allocated and aligned to 8B
  โ”‚  [0B remaining]โ”‚  โ† block exhausted, request new tiny block
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The tiny allocator packs multiple small objects into a single 16B slot, significantly reducing mspan usage and the number of objects GC needs to track. Note: only pointer-free objects can use the tiny allocator, because GC scanning processes each 16B slot as a whole unit.

mcache: Each P's Local Cache

Each P (logical processor) has an mcache, whose core is an array of mspan pointers indexed by size class:

// runtime/mcache.go (simplified)
type mcache struct {
    // tiny allocator state
    tiny       uintptr
    tinyoffset uintptr
    tinyAllocs uintptr

    // current active span for each size class
    // index: spanClass (size class * 2 + noscan flag)
    alloc [numSpanClasses]*mspan
    
    // local cache statistics
    local_largefree  uintptr
    local_nlargefree uintptr
    // ...
}

The alloc array has numSpanClasses = 136 slots (67 size classes ร— 2, distinguishing pointer-containing and pointer-free). Each slot points to the currently active mspan.

Allocation flow (small objects):

// Pseudocode describing the core path of mallocgc
func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
    // 1. Get current P's mcache (no lock needed)
    c := gomcache()
    
    // 2. Determine size class
    var sizeclass uint8
    if size <= maxSmallSize {
        if size <= maxTinySize && typ.ptrdata == 0 {
            // tiny allocation path
            return c.tinyAlloc(size)
        }
        sizeclass = size_to_class[size]
    }
    
    // 3. Get a slot from the corresponding span in mcache
    span := c.alloc[makeSpanClass(sizeclass, noscan)]
    v := span.nextFreeFast()  // Lock-free!
    if v == 0 {
        // 4. Span is full; request a new span from mcentral
        v, span = c.nextFree(sizeclass)
    }
    
    // 5. Initialize (zero) and return
    return unsafe.Pointer(v)
}

nextFreeFast() is the fastest path: it uses freeindex and allocCache (a 64-bit bitmap cache) to quickly locate the next free slot โ€” entirely lock-free, no system calls, approaching the theoretical limit of malloc call speed.

mcentral: The Central Repository for Each Size Class

When mcache's span for a particular size class is exhausted, it requests a new span from mcentral. Each size class has a corresponding mcentral, so the entire system has 136 mcentral instances (also split by pointer-containing vs pointer-free).

// runtime/mcentral.go (simplified)
type mcentral struct {
    spanclass spanClass
    // Spans with free slots (partial spans)
    partial [2]spanSet  // [0]: no sweep needed, [1]: needs sweep
    // Spans where all slots are allocated (full spans)
    full    [2]spanSet
}

mcentral operations require locking, but the lock granularity is one lock per size class (not one global lock), dramatically reducing lock contention.

mcentral span acquisition flow:

mcache requests a new span (size class N)
       โ”‚
       โ–ผ
mcentral[N].partial[already swept] non-empty?
   yes โ†’ remove a span and give it to mcache
   no
       โ”‚
       โ–ผ
mcentral[N].partial[needs sweep] non-empty?
   yes โ†’ sweep the span, give it to mcache
   no
       โ”‚
       โ–ผ
mcentral[N].full[needs sweep] non-empty?
   yes โ†’ sweep; if free slots found, give to mcache; otherwise put back to full
   no
       โ”‚
       โ–ผ
Request new contiguous pages from mheap, create a new span

mheap: The Global Heap Manager

mheap is the manager of the entire heap, responsible for:

// runtime/mheap.go (minimal version)
type mheap struct {
    lock  mutex  // global lock (but operations are infrequent)
    
    // Free page management (using treap)
    free  mTreap
    
    // Arena management
    arenas [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena
    
    // Array of central repositories
    central [numSpanClasses]struct {
        mcentral mcentral
        pad [cpu.CacheLinePadSize - ...
    }
    
    // Span metadata pool
    spanalloc fixalloc
    // ...
}

Large objects (> 32KB) go directly to mheap:

// Large object allocation
if size > maxSmallSize {
    // Allocate directly from mheap, bypassing mcache and mcentral
    s = mheap_.alloc(npages, 0)  // allocate by pages
    // Large objects use an entire span, with size class 0
}

Large objects bypass the size class system, allocated from mheap directly in however many pages are needed. Each large object allocation requires acquiring mheap's global lock, so frequent large object allocation significantly affects performance.

Level 3: Code Practice

Observing Memory Allocation: GODEBUG Tools

GODEBUG=gcchkmark=1: Verify GC marking correctness

This option enables extra validation during the GC marking phase: after the first marking pass completes, a second full marking pass is done and the two results are compared. If there are differences, it indicates a problem with concurrent GC marking (usually a runtime bug โ€” ordinary users are unlikely to encounter this).

GODEBUG=gcchkmark=1 ./myapp

GODEBUG=gcstoptheworld=1 forces synchronous GC (stop-the-world) for comparing correctness against asynchronous GC:

# Force synchronous GC (stop-the-world), for comparing against async GC correctness
GODEBUG=gcstoptheworld=1 ./myapp

# Print GC statistics
GODEBUG=gctrace=1 ./myapp

Output format for gctrace=1:

gc 1 @0.012s 2%: 0.026+0.99+0.002 ms clock, 0.21+0.46/0.55/0+0.019 ms cpu, 4->4->2 MB, 5 MB goal, 0 MB stacks, 0 MB globals, 8 P

Reading this:

Benchmarking: Quantifying Allocation Cost

// alloc_bench_test.go
package main

import (
    "testing"
    "sync"
)

// Test 1: stack allocation vs heap allocation
func stackAlloc() int {
    x := 0  // allocated on stack
    for i := 0; i < 1000; i++ {
        x += i
    }
    return x
}

//go:noinline
func heapAlloc() *int {
    x := new(int)  // forced heap allocation
    *x = 42
    return x
}

func BenchmarkStackAlloc(b *testing.B) {
    sum := 0
    for i := 0; i < b.N; i++ {
        sum += stackAlloc()
    }
    _ = sum
}

func BenchmarkHeapAlloc(b *testing.B) {
    for i := 0; i < b.N; i++ {
        p := heapAlloc()
        _ = *p
    }
}

// Test 2: allocation throughput for different object sizes
func BenchmarkAllocSmall(b *testing.B) {
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        _ = make([]byte, 64)    // small object, goes through mcache
    }
}

func BenchmarkAllocLarge(b *testing.B) {
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        _ = make([]byte, 64*1024)  // large object (64KB), goes through mheap
    }
}

// Test 3: scalability of concurrent allocation
func BenchmarkConcurrentAlloc(b *testing.B) {
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            buf := make([]byte, 256)
            _ = buf
        }
    })
}
$ go test -bench=. -benchmem -cpu=1,4,8 .

BenchmarkStackAlloc-1          500000000    2.1 ns/op    0 B/op    0 allocs/op
BenchmarkHeapAlloc-1           100000000   12.4 ns/op    8 B/op    1 allocs/op
BenchmarkAllocSmall-1          200000000    8.0 ns/op   64 B/op    1 allocs/op
BenchmarkAllocLarge-1            5000000  280.0 ns/op 65536 B/op   1 allocs/op
BenchmarkConcurrentAlloc-8     300000000    4.5 ns/op  256 B/op    1 allocs/op

Key observations:

  1. Stack allocation (2.1 ns) is ~6x faster than heap allocation (12.4 ns)
  2. Large object allocation (280 ns) is ~35x slower than small objects (8 ns) โ€” large objects go directly to mheap with a global lock
  3. Concurrent small object allocation scales well (the -8 variant at 4.5 ns approaches the theoretical 8ns/8 = 1ns ideal)

sync.Pool: The Key to Reducing GC Pressure

sync.Pool is the standard library's object pool, used to reuse short-lived objects and reduce heap allocations and GC pressure.

How it works:

sync.Pool maintains a local object pool (local) and a "victim" pool (available for stealing by other Ps) for each P. Get() preferentially takes objects from the local pool; Put() returns objects to the local pool. During GC, the victim pool is cleared and the local pool becomes the new victim pool โ€” this way, objects are retained for at most one GC cycle, preventing permanent leaks.

// pool_example.go
package main

import (
    "bytes"
    "fmt"
    "sync"
    "testing"
)

// Scenario: frequently creating bytes.Buffer in HTTP request handling
var bufPool = sync.Pool{
    New: func() interface{} {
        return new(bytes.Buffer)
    },
}

func processRequest(data []byte) string {
    // Get a buffer from the pool
    buf := bufPool.Get().(*bytes.Buffer)
    defer func() {
        buf.Reset()          // reset state
        bufPool.Put(buf)     // return to pool
    }()
    
    buf.Write(data)
    buf.WriteString(" processed")
    return buf.String()
}

// For comparison: without pool
func processRequestNoPool(data []byte) string {
    var buf bytes.Buffer  // new allocation every time, heap-allocated
    buf.Write(data)
    buf.WriteString(" processed")
    return buf.String()
}

func BenchmarkWithPool(b *testing.B) {
    data := []byte("hello world")
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        _ = processRequest(data)
    }
}

func BenchmarkWithoutPool(b *testing.B) {
    data := []byte("hello world")
    b.ReportAllocs()
    for i := 0; i < b.N; i++ {
        _ = processRequestNoPool(data)
    }
}

func main() {
    data := []byte("hello")
    fmt.Println(processRequest(data))
}
$ go test -bench=Benchmark -benchmem .
BenchmarkWithPool-8       50000000    28 ns/op    16 B/op    1 allocs/op
BenchmarkWithoutPool-8    20000000    72 ns/op   112 B/op    2 allocs/op

With sync.Pool, allocations per operation drop from 112B to 16B, with ~2.5x speed improvement.

Important notes on sync.Pool usage:

// Mistake 1: forgetting to Reset after Pool.Get()
buf := pool.Get().(*bytes.Buffer)
buf.Write(newData)
pool.Put(buf)  // No Reset! Next caller gets a buffer with leftover data

// Mistake 2: storing stateful connection objects without validating liveness
conn := connPool.Get().(*net.Conn)
// conn may have already timed out! Must ping/check state before use

// Mistake 3: storing large objects causing persistently high memory
// GC only clears the victim pool; if you continuously Put large objects,
// you may accumulate significant memory

runtime.MemStats: Reading Memory State in Depth

// memstats_example.go
package main

import (
    "fmt"
    "runtime"
    "time"
)

func printMemStats(label string) {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    
    fmt.Printf("\n=== %s ===\n", label)
    fmt.Printf("Alloc:        %7.2f MB  (live objects currently on heap)\n", float64(m.Alloc)/1e6)
    fmt.Printf("TotalAlloc:   %7.2f MB  (cumulative bytes allocated)\n", float64(m.TotalAlloc)/1e6)
    fmt.Printf("Sys:          %7.2f MB  (total memory requested from OS)\n", float64(m.Sys)/1e6)
    fmt.Printf("HeapAlloc:    %7.2f MB  (allocated on heap)\n", float64(m.HeapAlloc)/1e6)
    fmt.Printf("HeapInuse:    %7.2f MB  (spans in use on heap)\n", float64(m.HeapInuse)/1e6)
    fmt.Printf("HeapIdle:     %7.2f MB  (idle spans on heap)\n", float64(m.HeapIdle)/1e6)
    fmt.Printf("HeapReleased: %7.2f MB  (memory returned to OS)\n", float64(m.HeapReleased)/1e6)
    fmt.Printf("StackInuse:   %7.2f MB  (goroutine stack usage)\n", float64(m.StackInuse)/1e6)
    fmt.Printf("NumGC:        %7d      (total GC count)\n", m.NumGC)
    fmt.Printf("PauseTotalNs: %7.2f ms  (total GC STW pause time)\n", float64(m.PauseTotalNs)/1e6)
    
    if m.NumGC > 0 {
        fmt.Printf("LastGC pause: %7.2f ms  (most recent GC pause)\n",
            float64(m.PauseNs[(m.NumGC+255)%256])/1e6)
    }
}

func main() {
    printMemStats("Initial state")
    
    // Allocate 100MB of data
    data := make([][]byte, 1000)
    for i := range data {
        data[i] = make([]byte, 100*1024)  // 100KB each
    }
    printMemStats("After allocating 100MB")
    
    // Release references, trigger GC
    data = nil
    runtime.GC()
    printMemStats("After GC")
    
    time.Sleep(time.Second)  // wait for background memory return
    runtime.GC()
    printMemStats("After second GC")
}

Level 4: Advanced Topics and Edge Cases

Memory Fragmentation: Understanding and Mitigation

Go's allocator eliminates external fragmentation through its size class design, but internal fragmentation still exists. More importantly, when large numbers of objects are freed, memory may not be returned to the OS immediately.

How heap fragmentation forms:

Time 1: Allocate 1000 objects of 32KB (large objects, via mheap)
   Heap: [xxxxxxxxxxxxxxxxxxxxxxxxx...] 32MB in use

Time 2: Free 500 of them
   Heap: [x_x_x_x_x_x_x_x_x_x_x_...] 16MB in use, 16MB free

Time 3: Try to allocate 1 object of 64KB
   โ†’ Among the free 32KB blocks, there is no contiguous 64KB!
   โ†’ Must request new memory from OS
   โ†’ Actual RSS increases, but program "appears" to use only 16MB

This is a typical large-object fragmentation scenario. Go's mheap attempts to merge adjacent free spans (coalescing), but if two 32KB free blocks are not adjacent, they cannot be merged.

Mitigation strategies:

// Use debug.FreeOSMemory() to force memory return to OS (use with care)
import "runtime/debug"
debug.FreeOSMemory()  // triggers GC and tries to return maximum memory to OS

// Set GOGC environment variable to control GC trigger threshold
// GOGC=100 (default): GC triggers when heap grows 100%
// GOGC=50: GC triggers when heap grows 50% (more frequent GC, lower memory footprint)
// GOGC=200: GC triggers when heap grows 200% (less frequent GC, higher memory footprint)

// Go 1.19+: GOMEMLIMIT sets a soft memory ceiling
// When process memory exceeds this value, Go runtime GCs more aggressively and returns memory
GOMEMLIMIT=512MiB ./myapp

Arena Allocator (Go 1.20+)

Go 1.20 introduced an experimental arena allocator (arena package, under the goexperiment=arenas build tag), refined to arena.NewArena() in Go 1.21.

The arena's core idea: manage a batch of objects' memory collectively and release them all at once, rather than tracking them individually through GC.

//go:build goexperiment.arenas

package main

import (
    "arena"
    "fmt"
)

type Node struct {
    Value int
    Left  *Node
    Right *Node
}

func buildTree(a *arena.Arena, depth int) *Node {
    if depth == 0 {
        return nil
    }
    // Allocate from arena, not tracked by normal GC
    node := arena.New[Node](a)
    node.Value = depth
    node.Left = buildTree(a, depth-1)
    node.Right = buildTree(a, depth-1)
    return node
}

func main() {
    a := arena.NewArena()
    defer a.Free()  // batch release, no GC involvement

    root := buildTree(a, 20)  // approximately 2 million nodes
    fmt.Println("root value:", root.Value)
    // When the function returns, a.Free() releases all nodes at once
    // No waiting for GC, no pauses
}

Arena allocator advantages:

Limitations:

Arena allocators are especially valuable in this scenario: parsing large files, protobufs, or JSON while creating large numbers of short-lived temporary objects. With the traditional approach, these objects trigger heavy GC. With an arena, GC overhead approaches zero.

noscan Objects: Reducing GC Scan Volume

GC's scan phase (mark phase) must traverse all live objects, find their pointer fields, and then trace those pointers to the objects they point to. If an object contains no pointers, it can be entirely skipped during GC scanning, drastically reducing GC's workload.

Go's allocator marks such objects through the span's noscan property. The compiler, when generating code, checks type information to determine whether an object contains pointers and selects the appropriate noscan or regular span.

Engineering practice: designing pointer-free data structures

// Version A: contains pointers (GC scanning required)
type Event struct {
    ID        int64
    Name      string    // string internally has a pointer (to underlying byte array)
    Tags      []string  // slice internally has pointers
    Timestamp int64
}

// Version B: pointer-free (GC can skip)
type EventNoscan struct {
    ID        int64
    NameBuf   [32]byte  // fixed-size byte array, no pointers
    NameLen   int
    TagBits   uint64    // represent tags as a bitmask, no pointers
    Timestamp int64
}

In high-frequency allocation scenarios (log processing, event streams), designing data structures as pointer-free types can significantly reduce GC pause times.

runtime.MemStats: A Deep Reference

type MemStats struct {
    // --- Heap allocation stats ---
    Alloc      uint64  // bytes in live objects on heap right now
    TotalAlloc uint64  // cumulative bytes allocated over the process lifetime (including freed)
    Sys        uint64  // total memory requested from OS (all purposes)
    Lookups    uint64  // deprecated, always 0
    Mallocs    uint64  // cumulative objects allocated (including tiny-packed objects)
    Frees      uint64  // cumulative objects freed
    
    // --- Heap details ---
    HeapAlloc    uint64  // same as Alloc
    HeapSys      uint64  // heap memory requested from OS (including free-but-not-released)
    HeapIdle     uint64  // bytes in idle spans (potentially reclaimable by OS)
    HeapInuse    uint64  // bytes in active spans
    HeapReleased uint64  // bytes returned to OS via madvise
    HeapObjects  uint64  // current number of objects on heap
    
    // --- Stack details ---
    StackInuse  uint64  // goroutine stack usage
    StackSys    uint64  // stack memory requested from OS (including system stack)
    
    // --- Off-heap memory (used by Go runtime itself) ---
    MSpanInuse  uint64  // memory used by mspan structs
    MSpanSys    uint64
    MCacheInuse uint64  // memory used by mcache structs
    MCacheSys   uint64
    
    // --- GC stats ---
    GCSys        uint64  // GC metadata memory usage
    NextGC       uint64  // heap size threshold that will trigger the next GC
    LastGC        uint64  // Unix nanosecond timestamp of last GC
    PauseTotalNs uint64  // cumulative GC STW pause duration (nanoseconds)
    PauseNs      [256]uint64  // STW pause duration of the last 256 GCs (circular buffer)
    NumGC        uint32  // total GC count
    NumForcedGC  uint32  // count of manual runtime.GC() calls
    GCCPUFraction float64  // fraction of CPU used by GC (EWMA over past 5 minutes)
}

Practical diagnostic formulas:

Object survival rate = Alloc / TotalAlloc
  โ†’ If this is very low (e.g. 0.01), large numbers of short-lived objects exist; GC pressure is high

Heap fragmentation rate = (HeapInuse - HeapAlloc) / HeapInuse
  โ†’ High fragmentation means many empty slots within spans

Memory return rate = HeapReleased / HeapIdle
  โ†’ Low return rate means OS-level RSS is much higher than actual usage

GC pause fraction = PauseTotalNs / program_runtime_nanoseconds
  โ†’ Exceeding 0.1% usually warrants attention

Understanding Go's memory allocator lets you read every line of GODEBUG=gctrace=1 output, accurately locate allocation hot spots in pprof's heap profile, design GC-friendly data structures, and use sync.Pool or arenas in the right contexts to minimize GC pressure.

The memory allocator and GC are the two most complex systems in the Go runtime, but their design has a powerful internal logic: every decision is made to find the optimal balance among throughput, latency, and memory footprint. Once you understand that logic, you are no longer tuning knobs on a black box โ€” you are making evidence-based engineering decisions grounded in how the system actually works.

Rate this chapter
4.8  / 5  (22 ratings)

๐Ÿ’ฌ Comments