Memory Allocator: tcmalloc Ideas in Go
Memory Allocator: tcmalloc Ideas in Go
Every time you write make([]byte, 1024) or new(MyStruct), an elaborately designed mechanism springs into action. This mechanism determines where memory comes from, in what chunk sizes, how it is returned, and how the GC tracks which memory is still in use.
Most Go developers' understanding of this mechanism stops at "there's a GC that automatically reclaims memory." This understanding is sufficient for everyday coding, but when you start writing high-throughput services, investigating memory leaks, optimizing GC pause times, or trying to understand why your service's RSS suddenly spikes at a particular moment — that surface-level understanding falls far short.
This chapter begins from first principles, explaining why Go's memory allocator is designed the way it is, how its four-tier hierarchy (mheap → mcentral → mcache → mspan) works, which allocation path different object sizes take, and what tools you can use to observe and optimize memory behavior.
Level 1: What You Need to Know
Why a Custom Allocator? The Limitations of malloc
Before discussing Go's allocator, we need to understand the problems it solves. The simplest question: why not just use the C standard library's malloc?
Problem 1: malloc has lock contention in multithreaded scenarios.
Traditional malloc maintains a global free list. Every allocation and deallocation requires acquiring a global lock and updating this list. This is fine in single-threaded programs, but in Go's context there may be thousands of goroutines running simultaneously on a small number of OS threads, generating millions of small object allocations per second. The global lock becomes a severe performance bottleneck.
Experimental data supports this: on an 8-core machine, throughput for malloc/free-intensive programs is often only 2-3x that of single-core, not the expected 8x — lock contention consumes most of the parallelism benefit.
Problem 2: malloc's memory fragmentation.
Traditional malloc uses "best-fit" or "first-fit" strategies: it finds the most precisely fitting or first sufficiently large block in the free list. After long running times, the heap accumulates many small fragments that are individually too small to satisfy new large allocations but large in total. This external fragmentation causes programs to consume far more memory than they actually need.
Problem 3: malloc does not integrate with GC.
GC needs to know precisely which memory addresses store pointers (to trace the object reference graph). Memory allocated by traditional malloc carries no type information — GC cannot distinguish whether a block contains a pointer or an integer. Go's allocator records a pointer map (pointer bitmap) for each object at allocation time, allowing GC to scan precisely without needing to use conservative GC.
tcmalloc's Key Insight: Thread-Local Caches
In 2001, Google's Sanjay Ghemawat and Paul Menage developed tcmalloc (Thread-Caching Malloc), specifically to solve the lock contention problem of multithreaded malloc.
tcmalloc's key insight is very simple: push the free lists for frequently used small objects from a global structure down to per-thread locals.
Traditional malloc:
Thread1 ──┐
Thread2 ──┼──▶ [global lock] → [free list] → allocation
Thread3 ──┘
tcmalloc:
Thread1 → [local cache 1] → lock-free allocation (most cases)
Thread2 → [local cache 2] → lock-free allocation
Thread3 → [local cache 3] → lock-free allocation
│
▼ (lock only when cache is exhausted)
[central cache] → [heap]
Each thread has its own free list; small object allocation and deallocation complete locally with no lock at all. Only when the thread-local cache is exhausted (needs to request more from upper levels) or overflows (needs to return excess to upper levels) does interaction with the central cache occur — and the central cache locks at per-size-class granularity rather than globally.
Size classes are another key design element of tcmalloc. Instead of allocating exactly the right amount for each object, requests are rounded up to one of a few dozen predefined "size classes." For example, a 17-byte allocation actually gets 24 bytes (the next size class); a 100-byte allocation gets 112 bytes.
The benefit: external fragmentation is eliminated. Each size class holds only objects of one size, so freed slots can be directly reused by new objects of the same size — there is no "fragment too small to use" situation. The cost is internal fragmentation: each object wastes at most one size-class interval (typically no more than 12.5%). This trade-off is considered worthwhile in engineering practice.
Go's Allocator: A Redesign on tcmalloc's Shoulders
Go's memory allocator borrows tcmalloc's core ideas but redesigns them deeply, for three reasons:
-
Goroutines are not threads. Go's goroutines are lightweight coroutines running on a small number of OS threads (M). If each OS thread had its own cache, cache-switching when goroutines migrate between threads would cause problems. Go's solution is to bind caches to P (logical processors) rather than OS threads. Since one P only runs one goroutine at a time, access to P-local data is inherently lock-free.
-
Tight GC integration is required. The allocator must maintain metadata that GC needs: pointer bitmaps, object sizes, whether objects contain pointers (to reduce GC scan volume). This metadata is tightly bound to spans (the management unit for memory pages).
-
Large objects and special types need support. Tiny objects (
<16B), ordinary small objects (16B–32KB), and large objects (>32KB) take different allocation paths.noscanobjects (containing no pointers) can be skipped during GC scanning, further reducing GC overhead.
Level 2: The Principle — Four-Tier Memory Hierarchy
Overall Architecture
Go Memory Allocator Hierarchy
goroutine allocation request
│
▼
┌─────────────────────────────────┐
│ mcache (one per P) │ ← Lock-free, fastest path
│ tiny allocator (< 16B) │
│ small free lists (16B ~ 32KB) │
└────────────┬────────────────────┘
│ cache exhausted / overflow
▼
┌─────────────────────────────────┐
│ mcentral (one per size class) │ ← One lock per mcentral
│ list of spans with free slots │
│ list of full spans │
└────────────┬────────────────────┘
│ span exhausted
▼
┌─────────────────────────────────┐
│ mheap (global, single) │ ← Global lock (rare operations)
│ treap of free spans │
│ arena management │
└────────────┬────────────────────┘
│ heap memory insufficient
▼
OS (mmap / VirtualAlloc)
mspan: The Basic Unit of Memory Management
mspan is the basic unit of Go's memory management. It manages a range of contiguous memory pages, with each page being 8KB.
// runtime/mheap.go (simplified)
type mspan struct {
next *mspan // next span in linked list
prev *mspan
startAddr uintptr // start address of the span
npages uintptr // number of pages in the span
spanclass spanClass // size class (with noscan flag)
freeindex uintptr // index of next free slot
nelems uintptr // total number of slots in the span
allocBits *gcBits // bitmap: which slots are allocated
gcmarkBits *gcBits // bitmap: GC mark (which objects are alive)
allocCount uint16 // number of allocated slots
}
Each mspan belongs exclusively to one size class. For example, a size-class-8 mspan holds only 8-byte objects; a size-class-24 mspan holds only 24-byte objects. This "homogeneous" design is the key to eliminating external fragmentation.
mspan lifecycle:
Free pages in mheap
│ mheap.alloc() divides contiguous pages into a span
▼
mcentral's empty span list
│ mcache takes the span
▼
Active span held by mcache
│ object allocation (freeindex advances)
▼
Span is full (all slots allocated)
│ mcache returns span to mcentral
▼
mcentral's full span list
│ GC scans; free slots discovered
▼
mcentral's partial span list (has free slots)
│ span becomes completely empty
▼
Returned to mheap's free page pool
The Size Class System
Go 1.22 has 67 small object size classes (plus size class 0, totaling 68), plus a large object path.
Sample size classes (partial):
class bytes/obj bytes/span objects tail waste max waste
1 8 8192 1024 0 87.50%
2 16 8192 512 0 43.75%
3 24 8192 341 8 29.24%
4 32 8192 256 0 21.88%
5 48 8192 170 32 31.52%
6 64 8192 128 0 23.44%
7 80 8192 102 32 19.07%
...
36 512 8192 16 0 6.05%
37 640 40960 64 0 9.99%
...
66 32768 32768 1 0 12.50%
A request to allocate 18 bytes? Round up to 24 (size class 3), use a slot in mspan3.
The Tiny Allocator: Special Optimization for Very Small Objects
For objects < 16B that contain no pointers (like int8, bool, the content part of short strings, etc.), Go uses a dedicated tiny allocator.
Tiny allocator principle:
16B tiny block
┌────────────────┐
│ obj1 (2B) │
│ obj2 (4B) │
│ obj3 (3B) pad │ ← padded to align to 4B
│ [7B remaining]│ ← tinyoffset points here
└────────────────┘
Next allocation of 5B object:
┌────────────────┐
│ obj1 (2B) │
│ obj2 (4B) │
│ obj3 (3B) pad │
│ obj4 (5B) pad │ ← allocated and aligned to 8B
│ [0B remaining]│ ← block exhausted, request new tiny block
└────────────────┘
The tiny allocator packs multiple small objects into a single 16B slot, significantly reducing mspan usage and the number of objects GC needs to track. Note: only pointer-free objects can use the tiny allocator, because GC scanning processes each 16B slot as a whole unit.
mcache: Each P's Local Cache
Each P (logical processor) has an mcache, whose core is an array of mspan pointers indexed by size class:
// runtime/mcache.go (simplified)
type mcache struct {
// tiny allocator state
tiny uintptr
tinyoffset uintptr
tinyAllocs uintptr
// current active span for each size class
// index: spanClass (size class * 2 + noscan flag)
alloc [numSpanClasses]*mspan
// local cache statistics
local_largefree uintptr
local_nlargefree uintptr
// ...
}
The alloc array has numSpanClasses = 136 slots (67 size classes × 2, distinguishing pointer-containing and pointer-free). Each slot points to the currently active mspan.
Allocation flow (small objects):
// Pseudocode describing the core path of mallocgc
func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
// 1. Get current P's mcache (no lock needed)
c := gomcache()
// 2. Determine size class
var sizeclass uint8
if size <= maxSmallSize {
if size <= maxTinySize && typ.ptrdata == 0 {
// tiny allocation path
return c.tinyAlloc(size)
}
sizeclass = size_to_class[size]
}
// 3. Get a slot from the corresponding span in mcache
span := c.alloc[makeSpanClass(sizeclass, noscan)]
v := span.nextFreeFast() // Lock-free!
if v == 0 {
// 4. Span is full; request a new span from mcentral
v, span = c.nextFree(sizeclass)
}
// 5. Initialize (zero) and return
return unsafe.Pointer(v)
}
nextFreeFast() is the fastest path: it uses freeindex and allocCache (a 64-bit bitmap cache) to quickly locate the next free slot — entirely lock-free, no system calls, approaching the theoretical limit of malloc call speed.
mcentral: The Central Repository for Each Size Class
When mcache's span for a particular size class is exhausted, it requests a new span from mcentral. Each size class has a corresponding mcentral, so the entire system has 136 mcentral instances (also split by pointer-containing vs pointer-free).
// runtime/mcentral.go (simplified)
type mcentral struct {
spanclass spanClass
// Spans with free slots (partial spans)
partial [2]spanSet // [0]: no sweep needed, [1]: needs sweep
// Spans where all slots are allocated (full spans)
full [2]spanSet
}
mcentral operations require locking, but the lock granularity is one lock per size class (not one global lock), dramatically reducing lock contention.
mcentral span acquisition flow:
mcache requests a new span (size class N)
│
▼
mcentral[N].partial[already swept] non-empty?
yes → remove a span and give it to mcache
no
│
▼
mcentral[N].partial[needs sweep] non-empty?
yes → sweep the span, give it to mcache
no
│
▼
mcentral[N].full[needs sweep] non-empty?
yes → sweep; if free slots found, give to mcache; otherwise put back to full
no
│
▼
Request new contiguous pages from mheap, create a new span
mheap: The Global Heap Manager
mheap is the manager of the entire heap, responsible for:
- Requesting large blocks of memory from the OS (via
mmaporVirtualAlloc) - Managing arenas (memory arenas)
- Managing free page allocation (using a treap data structure)
- Maintaining the spans array (mapping addresses to mspan pointers)
// runtime/mheap.go (minimal version)
type mheap struct {
lock mutex // global lock (but operations are infrequent)
// Free page management (using treap)
free mTreap
// Arena management
arenas [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena
// Array of central repositories
central [numSpanClasses]struct {
mcentral mcentral
pad [cpu.CacheLinePadSize - ...
}
// Span metadata pool
spanalloc fixalloc
// ...
}
Large objects (> 32KB) go directly to mheap:
// Large object allocation
if size > maxSmallSize {
// Allocate directly from mheap, bypassing mcache and mcentral
s = mheap_.alloc(npages, 0) // allocate by pages
// Large objects use an entire span, with size class 0
}
Large objects bypass the size class system, allocated from mheap directly in however many pages are needed. Each large object allocation requires acquiring mheap's global lock, so frequent large object allocation significantly affects performance.
Level 3: Code Practice
Observing Memory Allocation: GODEBUG Tools
GODEBUG=gcchkmark=1: Verify GC marking correctness
This option enables extra validation during the GC marking phase: after the first marking pass completes, a second full marking pass is done and the two results are compared. If there are differences, it indicates a problem with concurrent GC marking (usually a runtime bug — ordinary users are unlikely to encounter this).
GODEBUG=gcchkmark=1 ./myapp
GODEBUG=gcstoptheworld=1 forces synchronous GC (stop-the-world) for comparing correctness against asynchronous GC:
# Force synchronous GC (stop-the-world), for comparing against async GC correctness
GODEBUG=gcstoptheworld=1 ./myapp
# Print GC statistics
GODEBUG=gctrace=1 ./myapp
Output format for gctrace=1:
gc 1 @0.012s 2%: 0.026+0.99+0.002 ms clock, 0.21+0.46/0.55/0+0.019 ms cpu, 4->4->2 MB, 5 MB goal, 0 MB stacks, 0 MB globals, 8 P
Reading this:
gc 1: 1st GC@0.012s: 12ms after program start2%: GC uses 2% of CPU time0.026+0.99+0.002 ms: three phase durations (STW sweep termination + concurrent mark + STW mark termination)4->4->2 MB: heap size when GC started → heap size when GC ended → size of live objects5 MB goal: trigger threshold for next GC8 P: using 8 Ps
Benchmarking: Quantifying Allocation Cost
// alloc_bench_test.go
package main
import (
"testing"
"sync"
)
// Test 1: stack allocation vs heap allocation
func stackAlloc() int {
x := 0 // allocated on stack
for i := 0; i < 1000; i++ {
x += i
}
return x
}
//go:noinline
func heapAlloc() *int {
x := new(int) // forced heap allocation
*x = 42
return x
}
func BenchmarkStackAlloc(b *testing.B) {
sum := 0
for i := 0; i < b.N; i++ {
sum += stackAlloc()
}
_ = sum
}
func BenchmarkHeapAlloc(b *testing.B) {
for i := 0; i < b.N; i++ {
p := heapAlloc()
_ = *p
}
}
// Test 2: allocation throughput for different object sizes
func BenchmarkAllocSmall(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = make([]byte, 64) // small object, goes through mcache
}
}
func BenchmarkAllocLarge(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = make([]byte, 64*1024) // large object (64KB), goes through mheap
}
}
// Test 3: scalability of concurrent allocation
func BenchmarkConcurrentAlloc(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
buf := make([]byte, 256)
_ = buf
}
})
}
$ go test -bench=. -benchmem -cpu=1,4,8 .
BenchmarkStackAlloc-1 500000000 2.1 ns/op 0 B/op 0 allocs/op
BenchmarkHeapAlloc-1 100000000 12.4 ns/op 8 B/op 1 allocs/op
BenchmarkAllocSmall-1 200000000 8.0 ns/op 64 B/op 1 allocs/op
BenchmarkAllocLarge-1 5000000 280.0 ns/op 65536 B/op 1 allocs/op
BenchmarkConcurrentAlloc-8 300000000 4.5 ns/op 256 B/op 1 allocs/op
Key observations:
- Stack allocation (2.1 ns) is ~6x faster than heap allocation (12.4 ns)
- Large object allocation (280 ns) is ~35x slower than small objects (8 ns) — large objects go directly to mheap with a global lock
- Concurrent small object allocation scales well (the -8 variant at 4.5 ns approaches the theoretical 8ns/8 = 1ns ideal)
sync.Pool: The Key to Reducing GC Pressure
sync.Pool is the standard library's object pool, used to reuse short-lived objects and reduce heap allocations and GC pressure.
How it works:
sync.Pool maintains a local object pool (local) and a "victim" pool (available for stealing by other Ps) for each P. Get() preferentially takes objects from the local pool; Put() returns objects to the local pool. During GC, the victim pool is cleared and the local pool becomes the new victim pool — this way, objects are retained for at most one GC cycle, preventing permanent leaks.
// pool_example.go
package main
import (
"bytes"
"fmt"
"sync"
"testing"
)
// Scenario: frequently creating bytes.Buffer in HTTP request handling
var bufPool = sync.Pool{
New: func() interface{} {
return new(bytes.Buffer)
},
}
func processRequest(data []byte) string {
// Get a buffer from the pool
buf := bufPool.Get().(*bytes.Buffer)
defer func() {
buf.Reset() // reset state
bufPool.Put(buf) // return to pool
}()
buf.Write(data)
buf.WriteString(" processed")
return buf.String()
}
// For comparison: without pool
func processRequestNoPool(data []byte) string {
var buf bytes.Buffer // new allocation every time, heap-allocated
buf.Write(data)
buf.WriteString(" processed")
return buf.String()
}
func BenchmarkWithPool(b *testing.B) {
data := []byte("hello world")
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = processRequest(data)
}
}
func BenchmarkWithoutPool(b *testing.B) {
data := []byte("hello world")
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = processRequestNoPool(data)
}
}
func main() {
data := []byte("hello")
fmt.Println(processRequest(data))
}
$ go test -bench=Benchmark -benchmem .
BenchmarkWithPool-8 50000000 28 ns/op 16 B/op 1 allocs/op
BenchmarkWithoutPool-8 20000000 72 ns/op 112 B/op 2 allocs/op
With sync.Pool, allocations per operation drop from 112B to 16B, with ~2.5x speed improvement.
Important notes on sync.Pool usage:
// Mistake 1: forgetting to Reset after Pool.Get()
buf := pool.Get().(*bytes.Buffer)
buf.Write(newData)
pool.Put(buf) // No Reset! Next caller gets a buffer with leftover data
// Mistake 2: storing stateful connection objects without validating liveness
conn := connPool.Get().(*net.Conn)
// conn may have already timed out! Must ping/check state before use
// Mistake 3: storing large objects causing persistently high memory
// GC only clears the victim pool; if you continuously Put large objects,
// you may accumulate significant memory
runtime.MemStats: Reading Memory State in Depth
// memstats_example.go
package main
import (
"fmt"
"runtime"
"time"
)
func printMemStats(label string) {
var m runtime.MemStats
runtime.ReadMemStats(&m)
fmt.Printf("\n=== %s ===\n", label)
fmt.Printf("Alloc: %7.2f MB (live objects currently on heap)\n", float64(m.Alloc)/1e6)
fmt.Printf("TotalAlloc: %7.2f MB (cumulative bytes allocated)\n", float64(m.TotalAlloc)/1e6)
fmt.Printf("Sys: %7.2f MB (total memory requested from OS)\n", float64(m.Sys)/1e6)
fmt.Printf("HeapAlloc: %7.2f MB (allocated on heap)\n", float64(m.HeapAlloc)/1e6)
fmt.Printf("HeapInuse: %7.2f MB (spans in use on heap)\n", float64(m.HeapInuse)/1e6)
fmt.Printf("HeapIdle: %7.2f MB (idle spans on heap)\n", float64(m.HeapIdle)/1e6)
fmt.Printf("HeapReleased: %7.2f MB (memory returned to OS)\n", float64(m.HeapReleased)/1e6)
fmt.Printf("StackInuse: %7.2f MB (goroutine stack usage)\n", float64(m.StackInuse)/1e6)
fmt.Printf("NumGC: %7d (total GC count)\n", m.NumGC)
fmt.Printf("PauseTotalNs: %7.2f ms (total GC STW pause time)\n", float64(m.PauseTotalNs)/1e6)
if m.NumGC > 0 {
fmt.Printf("LastGC pause: %7.2f ms (most recent GC pause)\n",
float64(m.PauseNs[(m.NumGC+255)%256])/1e6)
}
}
func main() {
printMemStats("Initial state")
// Allocate 100MB of data
data := make([][]byte, 1000)
for i := range data {
data[i] = make([]byte, 100*1024) // 100KB each
}
printMemStats("After allocating 100MB")
// Release references, trigger GC
data = nil
runtime.GC()
printMemStats("After GC")
time.Sleep(time.Second) // wait for background memory return
runtime.GC()
printMemStats("After second GC")
}
Level 4: Advanced Topics and Edge Cases
Memory Fragmentation: Understanding and Mitigation
Go's allocator eliminates external fragmentation through its size class design, but internal fragmentation still exists. More importantly, when large numbers of objects are freed, memory may not be returned to the OS immediately.
How heap fragmentation forms:
Time 1: Allocate 1000 objects of 32KB (large objects, via mheap)
Heap: [xxxxxxxxxxxxxxxxxxxxxxxxx...] 32MB in use
Time 2: Free 500 of them
Heap: [x_x_x_x_x_x_x_x_x_x_x_...] 16MB in use, 16MB free
Time 3: Try to allocate 1 object of 64KB
→ Among the free 32KB blocks, there is no contiguous 64KB!
→ Must request new memory from OS
→ Actual RSS increases, but program "appears" to use only 16MB
This is a typical large-object fragmentation scenario. Go's mheap attempts to merge adjacent free spans (coalescing), but if two 32KB free blocks are not adjacent, they cannot be merged.
Mitigation strategies:
// Use debug.FreeOSMemory() to force memory return to OS (use with care)
import "runtime/debug"
debug.FreeOSMemory() // triggers GC and tries to return maximum memory to OS
// Set GOGC environment variable to control GC trigger threshold
// GOGC=100 (default): GC triggers when heap grows 100%
// GOGC=50: GC triggers when heap grows 50% (more frequent GC, lower memory footprint)
// GOGC=200: GC triggers when heap grows 200% (less frequent GC, higher memory footprint)
// Go 1.19+: GOMEMLIMIT sets a soft memory ceiling
// When process memory exceeds this value, Go runtime GCs more aggressively and returns memory
GOMEMLIMIT=512MiB ./myapp
Arena Allocator (Go 1.20+)
Go 1.20 introduced an experimental arena allocator (arena package, under the goexperiment=arenas build tag), refined to arena.NewArena() in Go 1.21.
The arena's core idea: manage a batch of objects' memory collectively and release them all at once, rather than tracking them individually through GC.
//go:build goexperiment.arenas
package main
import (
"arena"
"fmt"
)
type Node struct {
Value int
Left *Node
Right *Node
}
func buildTree(a *arena.Arena, depth int) *Node {
if depth == 0 {
return nil
}
// Allocate from arena, not tracked by normal GC
node := arena.New[Node](a)
node.Value = depth
node.Left = buildTree(a, depth-1)
node.Right = buildTree(a, depth-1)
return node
}
func main() {
a := arena.NewArena()
defer a.Free() // batch release, no GC involvement
root := buildTree(a, 20) // approximately 2 million nodes
fmt.Println("root value:", root.Value)
// When the function returns, a.Free() releases all nodes at once
// No waiting for GC, no pauses
}
Arena allocator advantages:
- Extremely fast allocation (approaching pointer bumping: each allocation only needs
ptr += size) - Batch release with zero GC overhead
- No fragmentation (releasing the entire arena returns it to the system)
Limitations:
- All objects must have the same lifetime (bound to the arena)
- You cannot store pointers to arena objects in ordinary (non-arena) objects (violates GC write barrier invariants)
- Still experimental API; no stability guarantee
Arena allocators are especially valuable in this scenario: parsing large files, protobufs, or JSON while creating large numbers of short-lived temporary objects. With the traditional approach, these objects trigger heavy GC. With an arena, GC overhead approaches zero.
noscan Objects: Reducing GC Scan Volume
GC's scan phase (mark phase) must traverse all live objects, find their pointer fields, and then trace those pointers to the objects they point to. If an object contains no pointers, it can be entirely skipped during GC scanning, drastically reducing GC's workload.
Go's allocator marks such objects through the span's noscan property. The compiler, when generating code, checks type information to determine whether an object contains pointers and selects the appropriate noscan or regular span.
Engineering practice: designing pointer-free data structures
// Version A: contains pointers (GC scanning required)
type Event struct {
ID int64
Name string // string internally has a pointer (to underlying byte array)
Tags []string // slice internally has pointers
Timestamp int64
}
// Version B: pointer-free (GC can skip)
type EventNoscan struct {
ID int64
NameBuf [32]byte // fixed-size byte array, no pointers
NameLen int
TagBits uint64 // represent tags as a bitmask, no pointers
Timestamp int64
}
In high-frequency allocation scenarios (log processing, event streams), designing data structures as pointer-free types can significantly reduce GC pause times.
runtime.MemStats: A Deep Reference
type MemStats struct {
// --- Heap allocation stats ---
Alloc uint64 // bytes in live objects on heap right now
TotalAlloc uint64 // cumulative bytes allocated over the process lifetime (including freed)
Sys uint64 // total memory requested from OS (all purposes)
Lookups uint64 // deprecated, always 0
Mallocs uint64 // cumulative objects allocated (including tiny-packed objects)
Frees uint64 // cumulative objects freed
// --- Heap details ---
HeapAlloc uint64 // same as Alloc
HeapSys uint64 // heap memory requested from OS (including free-but-not-released)
HeapIdle uint64 // bytes in idle spans (potentially reclaimable by OS)
HeapInuse uint64 // bytes in active spans
HeapReleased uint64 // bytes returned to OS via madvise
HeapObjects uint64 // current number of objects on heap
// --- Stack details ---
StackInuse uint64 // goroutine stack usage
StackSys uint64 // stack memory requested from OS (including system stack)
// --- Off-heap memory (used by Go runtime itself) ---
MSpanInuse uint64 // memory used by mspan structs
MSpanSys uint64
MCacheInuse uint64 // memory used by mcache structs
MCacheSys uint64
// --- GC stats ---
GCSys uint64 // GC metadata memory usage
NextGC uint64 // heap size threshold that will trigger the next GC
LastGC uint64 // Unix nanosecond timestamp of last GC
PauseTotalNs uint64 // cumulative GC STW pause duration (nanoseconds)
PauseNs [256]uint64 // STW pause duration of the last 256 GCs (circular buffer)
NumGC uint32 // total GC count
NumForcedGC uint32 // count of manual runtime.GC() calls
GCCPUFraction float64 // fraction of CPU used by GC (EWMA over past 5 minutes)
}
Practical diagnostic formulas:
Object survival rate = Alloc / TotalAlloc
→ If this is very low (e.g. 0.01), large numbers of short-lived objects exist; GC pressure is high
Heap fragmentation rate = (HeapInuse - HeapAlloc) / HeapInuse
→ High fragmentation means many empty slots within spans
Memory return rate = HeapReleased / HeapIdle
→ Low return rate means OS-level RSS is much higher than actual usage
GC pause fraction = PauseTotalNs / program_runtime_nanoseconds
→ Exceeding 0.1% usually warrants attention
Understanding Go's memory allocator lets you read every line of GODEBUG=gctrace=1 output, accurately locate allocation hot spots in pprof's heap profile, design GC-friendly data structures, and use sync.Pool or arenas in the right contexts to minimize GC pressure.
The memory allocator and GC are the two most complex systems in the Go runtime, but their design has a powerful internal logic: every decision is made to find the optimal balance among throughput, latency, and memory footprint. Once you understand that logic, you are no longer tuning knobs on a black box — you are making evidence-based engineering decisions grounded in how the system actually works.