Chapter 24

Performance Analysis: pprof, trace and Benchmark

There is a category of engineer who, on encountering a performance problem, immediately opens the code, locates the section that "looks slow" by intuition, and starts optimizing. This practice has a name: Guess-Driven Development.

The typical outcome: two days spent tuning code that was not the bottleneck, no measurable performance gain, and three new bugs for the trouble.

Real performance engineering has one inviolable rule: measure first, then optimize. No measurement, no optimization.

This is not a suggestion — it is a constraint. Donald Knuth's 1974 observation that "premature optimization is the root of all evil" is cited endlessly and followed rarely. The reason is that intuition-driven optimization carries a seductive sense of certainty — "I know this is slow because..." — but program performance almost never behaves as intuition predicts, especially in a garbage-collected language.

Go ships a complete performance analysis toolchain: pprof (CPU, memory, goroutine, block, and mutex profiles), trace (concurrency visualization), and testing/benchmark (reproducible performance experiments). Mastering this toolchain is the engineering entry ticket to switching from guessing to measuring.

Level 1: The Performance Analysis Mindset

Why Intuition Fails

Modern program performance is determined by a constellation of nonlinear factors:

CPU cache hierarchy: the same code runs 10x slower with a different access pattern
Memory allocator: small allocations look cheap individually; GC pressure from millions of them is invisible until it explodes
Goroutine scheduling: switch timing is opaque; lock contention does not appear until you profile it
OS scheduling: when an M (OS thread) gets preempted is not under Go's control
Microarchitecture effects: branch prediction, instruction pipelining, and NUMA topology all affect hot paths

Intuition models "source code complexity," not "runtime performance cost." The correlation between these two is far weaker than it feels.

The Correct Sequence

1. Define target metric (P99 latency? throughput? peak memory?)
       │
       ▼
2. Establish a baseline (benchmark or production monitoring data)
       │
       ▼
3. Collect a profile (CPU / memory / goroutine)
       │
       ▼
4. Identify the bottleneck (which function uses the most CPU? where are allocations?)
       │
       ▼
5. Make exactly one change (control variables)
       │
       ▼
6. Re-run the benchmark, compare numbers
       │
       ▼
7. If improved, continue. If not, revert.

Every step requires data. "Feels faster" is not evidence.

Common Premature Optimization Mistakes

Mistake 1: Optimizing a non-hot path

A function called a billion times at 1 ns each contributes 1 s of CPU time. A function called once taking 10 ms contributes 10 ms. Optimizing the billion-call function when the one-call function is the bottleneck is pure waste. Only a profile tells you which is which.

Mistake 2: Equating "allocation" with "slow"

In Go, a small heap allocation costs ~10 ns — fast enough to ignore in isolation. The cost is GC: millions of allocations force GC cycles, whose concurrent mark phase and stop-the-world pauses show up as latency spikes. The allocs profile finds allocation hotspots; it does not mean you should fear every make().

Mistake 3: Micro-optimizing without checking the macro picture

Shaving a function from 50 ns to 40 ns is meaningless if that function accounts for 0.1% of total runtime. Confirm with pprof's flat/cumulative breakdown that a function is actually hot before micro-optimizing it.

Level 2: pprof — Principles and Usage

Enabling the net/http/pprof Endpoint

A single blank import wires pprof into your HTTP server:

import _ "net/http/pprof"

The init() function in that package registers the following routes on DefaultServeMux:

GET /debug/pprof/              overview page (HTML)
GET /debug/pprof/profile       CPU profile (default 30 s)
GET /debug/pprof/heap          heap snapshot
GET /debug/pprof/goroutine     stack traces of all goroutines
GET /debug/pprof/allocs        memory allocation profile
GET /debug/pprof/block         goroutine blocking events
GET /debug/pprof/mutex         mutex contention
GET /debug/pprof/threadcreate  OS thread creation
GET /debug/pprof/trace         runtime execution trace

If your service uses a custom router (not DefaultServeMux), register manually:

import (
    "net/http/pprof"
    "github.com/gorilla/mux"
)

func registerPprof(r *mux.Router) {
    r.HandleFunc("/debug/pprof/", pprof.Index)
    r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
    r.HandleFunc("/debug/pprof/profile", pprof.Profile)
    r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
    r.HandleFunc("/debug/pprof/trace", pprof.Trace)
    r.Handle("/debug/pprof/goroutine", pprof.Handler("goroutine"))
    r.Handle("/debug/pprof/heap", pprof.Handler("heap"))
    r.Handle("/debug/pprof/allocs", pprof.Handler("allocs"))
    r.Handle("/debug/pprof/block", pprof.Handler("block"))
    r.Handle("/debug/pprof/mutex", pprof.Handler("mutex"))
}

Production security: pprof endpoints expose program internals. Restrict them to loopback or internal networks:

// Expose pprof on a separate internal port
go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

CPU Profile: Sampling and Interpretation

The CPU profiler is a sampling profiler. The Go runtime sends SIGPROF to the process every 10 ms (default), capturing the call stacks of all running M threads at that instant.

Sampling rate:    100 Hz (one sample every 10 ms)
Default duration: 30 s  → ~3000 sample points
Profile content:  complete call stack at each sample
Interpretation:   number of times a function appears ≈ fraction of CPU time

Collecting a CPU profile:

# Via HTTP endpoint (production)
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Programmatically
import "runtime/pprof"

f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
// ... run the code to analyze ...

Analyzing the profile:

go tool pprof cpu.prof

# Common commands
(pprof) top10           # sorted by flat time; top 10 functions
(pprof) top10 -cum      # sorted by cumulative time
(pprof) list funcName   # source-annotated view for one function
(pprof) web             # SVG call graph in browser (requires graphviz)

Flat vs cumulative — what each means:

Call tree:
main()
  └── processRequest()        cumulative = 1000 ms
        ├── parseJSON()            flat = 200 ms, cumulative = 200 ms
        ├── queryDB()              flat = 100 ms, cumulative = 700 ms
        │     └── marshalSQL()         flat = 600 ms, cumulative = 600 ms
        └── serialize()            flat = 100 ms, cumulative = 100 ms

flat time:       time the function itself ran (excluding callees)
cumulative time: total time including all callees in the subtree

marshalSQL has the highest flat → it is the true CPU hotspot
queryDB has high cumulative → it is the entry point to the slow path

Memory Profile: allocs vs heap

Go offers two distinct memory profiles:

allocs profile (allocation trace):

Records every heap allocation since program start, including objects already collected
Use to find what is generating GC pressure
Sampled: one sample per 512 KB of allocations (by default)

heap profile (heap snapshot):

Records currently live heap objects
Use to find memory leaks ("who holds large amounts of memory without releasing it")
Reflects the actual heap state after the last GC

# Allocation profile
go tool pprof http://localhost:6060/debug/pprof/allocs

# Heap snapshot
go tool pprof http://localhost:6060/debug/pprof/heap

# Switch sample index inside the interactive session
(pprof) sample_index=alloc_space    # bytes allocated (allocs profile)
(pprof) sample_index=alloc_objects  # object count allocated
(pprof) sample_index=inuse_space    # bytes currently live (heap profile)
(pprof) sample_index=inuse_objects  # live object count

Goroutine Profile: Detecting Leaks

go tool pprof http://localhost:6060/debug/pprof/goroutine

# Raw text output — often more readable
curl http://localhost:6060/debug/pprof/goroutine?debug=2

Example output (excerpt):

goroutine 1 [running]:
main.main()
    /app/main.go:45 +0x1a

goroutine 18 [chan receive, 3 minutes]:   ← waiting 3 min — likely leaked
net/http.(*persistConn).readLoop(0xc0001a4200)
    /usr/local/go/src/net/http/transport.go:2205 +0x99

goroutine 42 [select]:
database/sql.(*DB).connectionOpener(0xc0000b4000, 0x...)
    /usr/local/go/src/database/sql/sql.go:1189 +0x66

A monotonically increasing goroutine count over time signals a leak. The goroutine profile shows which type of goroutine is accumulating.

block Profile and mutex Profile

Both are disabled by default:

import "runtime"

// block profile: records time goroutines spend blocked on synchronization
// rate=1 captures everything (expensive); use 10000 in production
runtime.SetBlockProfileRate(10000)  // sample one event per 10000 ns of blocking

// mutex profile: records mutex contention events
// fraction=5 means roughly 1/5 of contention events are recorded
runtime.SetMutexProfileFraction(5)

go tool pprof http://localhost:6060/debug/pprof/block
go tool pprof http://localhost:6060/debug/pprof/mutex

block profile reveals:

Channel send/receive stalls
sync.Mutex.Lock() waiting time
Syscall blocking time (for goroutines not managed by netpoll)

mutex profile reveals:

Functions holding a lock for a long time
Highly contended mutexes with many goroutines waiting

Level 3: Code Practice

Writing Benchmarks Correctly

// string_bench_test.go
package strutil_test

import (
    "fmt"
    "math/rand"
    "sort"
    "strings"
    "testing"
    "unicode/utf8"
)

// Benchmark functions must be named Benchmark*
func BenchmarkStringContains(b *testing.B) {
    haystack := strings.Repeat("hello world ", 1000)
    needle := "world"

    // b.ResetTimer() after expensive setup
    // so setup time is excluded from the measurement
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        _ = strings.Contains(haystack, needle)
    }
}

// Comparative benchmarks — same operation, two implementations
func BenchmarkRuneCount_Len(b *testing.B) {
    s := strings.Repeat("Hello World", 1000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = len(s) // byte count — O(1)
    }
}

func BenchmarkRuneCount_Utf8(b *testing.B) {
    s := strings.Repeat("Hello World", 1000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = utf8.RuneCountInString(s) // rune count — O(n)
    }
}

// b.ReportAllocs() forces allocation output for this benchmark only.
// Equivalent to -benchmem but scoped to a single function.
func BenchmarkStringBuilder(b *testing.B) {
    b.ReportAllocs()
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        var sb strings.Builder
        for j := 0; j < 100; j++ {
            sb.WriteString("hello ")
        }
        _ = sb.String()
    }
}

// Table-driven benchmarks for scaling analysis
func BenchmarkSort(b *testing.B) {
    for _, size := range []int{10, 100, 1000, 10000} {
        b.Run(fmt.Sprintf("n=%d", size), func(b *testing.B) {
            src := make([]int, size)
            for i := range src {
                src[i] = rand.Intn(10000)
            }
            b.ResetTimer()
            for i := 0; i < b.N; i++ {
                tmp := make([]int, len(src))
                copy(tmp, src)
                sort.Ints(tmp)
            }
        })
    }
}

Running benchmarks:

# Basic run with allocation stats
go test -bench=. -benchmem ./...

# Longer run for stability (default is 1 s per benchmark)
go test -bench=BenchmarkStringContains -benchtime=10s -count=5

# Run a specific sub-benchmark (regex match)
go test -bench=BenchmarkSort/n=1000 -benchmem

# Collect CPU and memory profiles while benchmarking
go test -bench=. -cpuprofile=cpu.prof -memprofile=mem.prof
go tool pprof cpu.prof

benchstat: Statistical Comparison

Single-run benchmark output contains noise. Use multiple runs and benchstat to establish statistical significance:

# Install
go install golang.org/x/perf/cmd/benchstat@latest

# Collect baseline — 10 runs
go test -bench=BenchmarkStringBuilder -count=10 > old.txt

# Apply your change, then collect again
go test -bench=BenchmarkStringBuilder -count=10 > new.txt

# Compare
benchstat old.txt new.txt

Example output:

name              old time/op    new time/op    delta
StringBuilder-8   4.23µs ± 2%   1.87µs ± 1%   -55.80%  (p=0.000 n=10+10)

name              old alloc/op   new alloc/op   delta
StringBuilder-8   2.56kB ± 0%   1.02kB ± 0%   -60.16%  (p=0.000 n=10+10)

p=0.000 confirms statistical significance (t-test). ± 2% is the coefficient of variation; below 5% means the measurement is stable and the delta is trustworthy.

go tool trace: Concurrency Visualization

pprof answers "which function is slowest." go tool trace answers "in what order did things happen, and where was time lost to scheduling?"

import "runtime/trace"

f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// ... run the code under analysis ...

Via HTTP:

curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out
go tool trace trace.out
# Opens an interactive browser UI

Key views in the trace UI:

Goroutine analysis: individual goroutine lifetimes — running, runnable, blocked, sleeping
Scheduler trace: real-time GMP state; P utilization per core
Heap: GC trigger points overlaid on heap growth
Blocking profile: aggregated blocking time by cause

Reading a scheduler trace:

Healthy (CPU-bound, fully utilized):
P0: [====goroutine A====][====goroutine B====][====goroutine A====]
P1: [====goroutine C====][====goroutine D====][====goroutine C====]
→ Both Ps are saturated

Problematic (lock contention):
P0: [=A=][        waiting for lock        ][=A=]
P1: [=B======================][=B=]
→ P0 wastes a full core waiting; the lock on P1 is the bottleneck

Reading Flame Graphs

# Generate a flame graph via the pprof HTTP UI
go tool pprof -http=:8000 cpu.prof
# Select "Flame Graph" in the browser

Flame graph rules:

X axis: time (sample count); width = fraction of total CPU time
Y axis: call stack depth (bottom = main, top = deepest callee)
Color: random, carries no semantic meaning in most tools

Reading tip: look for the widest "plateau" — a function that is wide at the top of a stack means CPU time is concentrated in that function itself (not in its callees). That function is your optimization target.

ASCII flame graph sketch:

                     ┌─────────┐
         ┌───────────┤  encode │
         │  marshal  └─────────┘
┌────────┴──────────────────────┬──────────────┐
│        processRequest         │  serialize   │
├───────────────────────────────┴──────────────┤
│                   handleHTTP                  │
└───────────────────────────────────────────────┘

encode is a wide plateau at 30% → it is the CPU hotspot

Level 4: Advanced Topics

Continuous Profiling in Production

Continuous profiling captures performance data under real traffic — the only way to observe issues that never appear in staging.

Rolling profile collection:

func startContinuousProfiling(dir string) {
    go func() {
        for range time.Tick(5 * time.Minute) {
            path := filepath.Join(dir,
                fmt.Sprintf("cpu-%s.prof", time.Now().Format("20060102-150405")))
            f, err := os.Create(path)
            if err != nil {
                log.Printf("pprof: create file: %v", err)
                continue
            }
            pprof.StartCPUProfile(f)
            time.Sleep(30 * time.Second)
            pprof.StopCPUProfile()
            f.Close()
        }
    }()
}

Open-source / commercial options:

Pyroscope: open-source continuous profiling platform supporting Go, Python, Ruby, and eBPF
Datadog Continuous Profiler: commercial, integrated with APM traces
Google Cloud Profiler: managed solution on GCP

runtime/trace Tasks and Regions

runtime/trace provides user-level tracing primitives for annotating your own code paths:

import "runtime/trace"

func processRequest(ctx context.Context, req *Request) (*Response, error) {
    // Task: a logical operation that may span goroutines
    ctx, task := trace.NewTask(ctx, "processRequest")
    defer task.End()

    // Region: a timed segment within a single goroutine
    trace.WithRegion(ctx, "parseJSON", func() {
        req.body = parseJSON(req.rawBody)
    })

    trace.WithRegion(ctx, "queryDB", func() {
        req.data = queryDB(ctx, req.body)
    })

    // Structured log entry visible in the trace viewer
    trace.Log(ctx, "requestID", req.ID)

    return buildResponse(req), nil
}

In go tool trace's "User-defined tasks" view you see latency distributions for each task and time breakdowns for each region within. This is more precise than sampling-based pprof — useful for diagnosing tail latency in specific business flows.

Escape Analysis and Allocation Profiles

Go's compiler decides whether a variable lives on the stack or the heap during escape analysis. Understanding escape analysis is the key to interpreting allocation profiles.

# Show escape analysis decisions
go build -gcflags="-m" ./...

# Verbose output
go build -gcflags="-m -m" ./...

Common escape triggers:

// Interface boxing — the concrete value escapes to the heap
func printAny(v interface{}) { fmt.Println(v) }
x := 42
printAny(x) // x escapes: boxing requires a heap allocation

// Returning a pointer — the pointed-to value must outlive the function
func newUser() *User {
    u := User{Name: "Alice"}
    return &u // u escapes: caller holds the address
}

// No escape — pointer used locally only
func localPtr() {
    u := User{Name: "Bob"}
    process(&u) // if process does not store &u, u stays on the stack
}

// Oversized allocation — too large for the stack
func bigSlice() []int {
    s := make([]int, 1_000_000) // escapes to heap regardless
    return s
}

The chain: escape → heap allocation → GC pressure → latency spikes.

When the allocs profile shows a function generating heavy allocation, run escape analysis on that code. Find out why variables escape (interface conversion? closure capture? pointer returned?) and eliminate the escape where possible.

False sharing occurs when two independent variables occupy the same CPU cache line (64 bytes) and are written by different goroutines simultaneously:

// Problematic: two counters share one cache line
type Counters struct {
    reads  int64  // offset 0
    writes int64  // offset 8
    // Both fit in one 64-byte cache line!
    // A write to either invalidates the other on every core.
}

// Fixed: padding forces each counter into its own cache line
type PaddedCounter struct {
    value int64
    _     [56]byte // pad to exactly 64 bytes
}

type Counters struct {
    reads  PaddedCounter
    writes PaddedCounter
}

False sharing is hard to see directly in pprof — it appears as elevated CPU time without a clear hot function. Confirmation typically requires hardware performance counters via perf stat on Linux or Instruments on macOS.

In go tool trace, a symptom is mutual scheduling interference between goroutines doing unrelated work: if goroutine A and goroutine B are independent yet each makes the other slower, false sharing is a plausible suspect.

GOGC Tuning Based on Profiling

GOGC controls when GC triggers:

GOGC=100 (default): trigger when heap size reaches 2× live data after previous GC
GOGC=200:           trigger at 3× live data (less frequent GC, more memory used)
GOGC=50:            trigger at 1.5× live data (more frequent GC, lower memory)
GOGC=off:           disable GC (extreme cases only)

Observe GC behavior:

GODEBUG=gctrace=1 ./myapp

# Output line format:
# gc 7 @3.812s 4%: 0.19+2.5+0.041 ms clock, 1.5+4.1/11/29+0.33 ms cpu,
#    42->43->22 MB, 44 MB goal, 8 P

# Fields:
# gc 7:         7th GC cycle
# @3.812s:      3.812 seconds into program execution
# 4%:           GC consumed 4% of CPU time
# 0.19+2.5+0.041 ms: STW sweep-termination + concurrent mark + STW mark-termination
# 42->43->22 MB: heap before GC → heap after GC → live data
# 44 MB goal:   next GC triggers at 44 MB (22 MB × (1 + GOGC/100))

Tuning workflow:

Observe GC fraction in gctrace=1 output
GC fraction > 20%: increase GOGC to reduce frequency (accept more memory usage)
Memory footprint exceeds budget: decrease GOGC or fix allocation hotspots
Go 1.19+: combine GOMEMLIMIT with GOGC for soft memory cap with adaptive GC:

import "runtime/debug"

// GC becomes more aggressive as heap approaches 500 MB;
// prevents OOM while keeping GOGC high most of the time
debug.SetMemoryLimit(500 << 20) // 500 MB

Summary

Performance analysis is empirical science. The complete toolkit:

Tool	Purpose	Question answered
`go test -bench`	Reproducible experiments	How fast is implementation A vs B?
`benchstat`	Statistical comparison	Is the improvement real or noise?
CPU profile (pprof)	Sampled CPU hotspots	Which function uses the most CPU?
allocs profile	Allocation hotspots	Who is generating GC pressure?
heap profile	Live object analysis	Where is memory leaking?
goroutine profile	Goroutine state	Are goroutines leaking?
block profile	Synchronization stall time	Where are goroutines waiting?
mutex profile	Lock contention	Which lock is most contested?
`go tool trace`	Concurrency visualization	Where did scheduling overhead go?

Two rules to memorize: no measurement, no optimization and change exactly one thing at a time. These two rules are the entire secret to switching from intuition-driven guessing to evidence-driven engineering.

Rate this chapter

4.6 / 5 (7 ratings)

Performance Analysis: pprof, trace and Benchmark

Performance Analysis: pprof, trace and Benchmark

Level 1: The Performance Analysis Mindset

Why Intuition Fails

The Correct Sequence

Common Premature Optimization Mistakes

Level 2: pprof — Principles and Usage

Enabling the net/http/pprof Endpoint

CPU Profile: Sampling and Interpretation

Memory Profile: allocs vs heap

Goroutine Profile: Detecting Leaks

block Profile and mutex Profile

Level 3: Code Practice

Writing Benchmarks Correctly

benchstat: Statistical Comparison

go tool trace: Concurrency Visualization

Reading Flame Graphs

Level 4: Advanced Topics

Continuous Profiling in Production

runtime/trace Tasks and Regions

Escape Analysis and Allocation Profiles

Identifying False Sharing

GOGC Tuning Based on Profiling

Summary

💬 Comments