Chapter 24

Performance Analysis: pprof, trace and Benchmark

Performance Analysis: pprof, trace and Benchmark

There is a category of engineer who, on encountering a performance problem, immediately opens the code, locates the section that "looks slow" by intuition, and starts optimizing. This practice has a name: Guess-Driven Development.

The typical outcome: two days spent tuning code that was not the bottleneck, no measurable performance gain, and three new bugs for the trouble.

Real performance engineering has one inviolable rule: measure first, then optimize. No measurement, no optimization.

This is not a suggestion — it is a constraint. Donald Knuth's 1974 observation that "premature optimization is the root of all evil" is cited endlessly and followed rarely. The reason is that intuition-driven optimization carries a seductive sense of certainty — "I know this is slow because..." — but program performance almost never behaves as intuition predicts, especially in a garbage-collected language.

Go ships a complete performance analysis toolchain: pprof (CPU, memory, goroutine, block, and mutex profiles), trace (concurrency visualization), and testing/benchmark (reproducible performance experiments). Mastering this toolchain is the engineering entry ticket to switching from guessing to measuring.


Level 1: The Performance Analysis Mindset

Why Intuition Fails

Modern program performance is determined by a constellation of nonlinear factors:

Intuition models "source code complexity," not "runtime performance cost." The correlation between these two is far weaker than it feels.

The Correct Sequence

1. Define target metric (P99 latency? throughput? peak memory?)
       │
       ▼
2. Establish a baseline (benchmark or production monitoring data)
       │
       ▼
3. Collect a profile (CPU / memory / goroutine)
       │
       ▼
4. Identify the bottleneck (which function uses the most CPU? where are allocations?)
       │
       ▼
5. Make exactly one change (control variables)
       │
       ▼
6. Re-run the benchmark, compare numbers
       │
       ▼
7. If improved, continue. If not, revert.

Every step requires data. "Feels faster" is not evidence.

Common Premature Optimization Mistakes

Mistake 1: Optimizing a non-hot path

A function called a billion times at 1 ns each contributes 1 s of CPU time. A function called once taking 10 ms contributes 10 ms. Optimizing the billion-call function when the one-call function is the bottleneck is pure waste. Only a profile tells you which is which.

Mistake 2: Equating "allocation" with "slow"

In Go, a small heap allocation costs ~10 ns — fast enough to ignore in isolation. The cost is GC: millions of allocations force GC cycles, whose concurrent mark phase and stop-the-world pauses show up as latency spikes. The allocs profile finds allocation hotspots; it does not mean you should fear every make().

Mistake 3: Micro-optimizing without checking the macro picture

Shaving a function from 50 ns to 40 ns is meaningless if that function accounts for 0.1% of total runtime. Confirm with pprof's flat/cumulative breakdown that a function is actually hot before micro-optimizing it.


Level 2: pprof — Principles and Usage

Enabling the net/http/pprof Endpoint

A single blank import wires pprof into your HTTP server:

import _ "net/http/pprof"

The init() function in that package registers the following routes on DefaultServeMux:

GET /debug/pprof/              overview page (HTML)
GET /debug/pprof/profile       CPU profile (default 30 s)
GET /debug/pprof/heap          heap snapshot
GET /debug/pprof/goroutine     stack traces of all goroutines
GET /debug/pprof/allocs        memory allocation profile
GET /debug/pprof/block         goroutine blocking events
GET /debug/pprof/mutex         mutex contention
GET /debug/pprof/threadcreate  OS thread creation
GET /debug/pprof/trace         runtime execution trace

If your service uses a custom router (not DefaultServeMux), register manually:

import (
    "net/http/pprof"
    "github.com/gorilla/mux"
)

func registerPprof(r *mux.Router) {
    r.HandleFunc("/debug/pprof/", pprof.Index)
    r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
    r.HandleFunc("/debug/pprof/profile", pprof.Profile)
    r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
    r.HandleFunc("/debug/pprof/trace", pprof.Trace)
    r.Handle("/debug/pprof/goroutine", pprof.Handler("goroutine"))
    r.Handle("/debug/pprof/heap", pprof.Handler("heap"))
    r.Handle("/debug/pprof/allocs", pprof.Handler("allocs"))
    r.Handle("/debug/pprof/block", pprof.Handler("block"))
    r.Handle("/debug/pprof/mutex", pprof.Handler("mutex"))
}

Production security: pprof endpoints expose program internals. Restrict them to loopback or internal networks:

// Expose pprof on a separate internal port
go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

CPU Profile: Sampling and Interpretation

The CPU profiler is a sampling profiler. The Go runtime sends SIGPROF to the process every 10 ms (default), capturing the call stacks of all running M threads at that instant.

Sampling rate:    100 Hz (one sample every 10 ms)
Default duration: 30 s  → ~3000 sample points
Profile content:  complete call stack at each sample
Interpretation:   number of times a function appears ≈ fraction of CPU time

Collecting a CPU profile:

# Via HTTP endpoint (production)
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Programmatically
import "runtime/pprof"

f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
// ... run the code to analyze ...

Analyzing the profile:

go tool pprof cpu.prof

# Common commands
(pprof) top10           # sorted by flat time; top 10 functions
(pprof) top10 -cum      # sorted by cumulative time
(pprof) list funcName   # source-annotated view for one function
(pprof) web             # SVG call graph in browser (requires graphviz)

Flat vs cumulative — what each means:

Call tree:
main()
  └── processRequest()        cumulative = 1000 ms
        ├── parseJSON()            flat = 200 ms, cumulative = 200 ms
        ├── queryDB()              flat = 100 ms, cumulative = 700 ms
        │     └── marshalSQL()         flat = 600 ms, cumulative = 600 ms
        └── serialize()            flat = 100 ms, cumulative = 100 ms

flat time:       time the function itself ran (excluding callees)
cumulative time: total time including all callees in the subtree

marshalSQL has the highest flat → it is the true CPU hotspot
queryDB has high cumulative → it is the entry point to the slow path

Memory Profile: allocs vs heap

Go offers two distinct memory profiles:

allocs profile (allocation trace):

heap profile (heap snapshot):

# Allocation profile
go tool pprof http://localhost:6060/debug/pprof/allocs

# Heap snapshot
go tool pprof http://localhost:6060/debug/pprof/heap

# Switch sample index inside the interactive session
(pprof) sample_index=alloc_space    # bytes allocated (allocs profile)
(pprof) sample_index=alloc_objects  # object count allocated
(pprof) sample_index=inuse_space    # bytes currently live (heap profile)
(pprof) sample_index=inuse_objects  # live object count

Goroutine Profile: Detecting Leaks

go tool pprof http://localhost:6060/debug/pprof/goroutine

# Raw text output — often more readable
curl http://localhost:6060/debug/pprof/goroutine?debug=2

Example output (excerpt):

goroutine 1 [running]:
main.main()
    /app/main.go:45 +0x1a

goroutine 18 [chan receive, 3 minutes]:   ← waiting 3 min — likely leaked
net/http.(*persistConn).readLoop(0xc0001a4200)
    /usr/local/go/src/net/http/transport.go:2205 +0x99

goroutine 42 [select]:
database/sql.(*DB).connectionOpener(0xc0000b4000, 0x...)
    /usr/local/go/src/database/sql/sql.go:1189 +0x66

A monotonically increasing goroutine count over time signals a leak. The goroutine profile shows which type of goroutine is accumulating.

block Profile and mutex Profile

Both are disabled by default:

import "runtime"

// block profile: records time goroutines spend blocked on synchronization
// rate=1 captures everything (expensive); use 10000 in production
runtime.SetBlockProfileRate(10000)  // sample one event per 10000 ns of blocking

// mutex profile: records mutex contention events
// fraction=5 means roughly 1/5 of contention events are recorded
runtime.SetMutexProfileFraction(5)
go tool pprof http://localhost:6060/debug/pprof/block
go tool pprof http://localhost:6060/debug/pprof/mutex

block profile reveals:

mutex profile reveals:


Level 3: Code Practice

Writing Benchmarks Correctly

// string_bench_test.go
package strutil_test

import (
    "fmt"
    "math/rand"
    "sort"
    "strings"
    "testing"
    "unicode/utf8"
)

// Benchmark functions must be named Benchmark*
func BenchmarkStringContains(b *testing.B) {
    haystack := strings.Repeat("hello world ", 1000)
    needle := "world"

    // b.ResetTimer() after expensive setup
    // so setup time is excluded from the measurement
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        _ = strings.Contains(haystack, needle)
    }
}

// Comparative benchmarks — same operation, two implementations
func BenchmarkRuneCount_Len(b *testing.B) {
    s := strings.Repeat("Hello World", 1000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = len(s) // byte count — O(1)
    }
}

func BenchmarkRuneCount_Utf8(b *testing.B) {
    s := strings.Repeat("Hello World", 1000)
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _ = utf8.RuneCountInString(s) // rune count — O(n)
    }
}

// b.ReportAllocs() forces allocation output for this benchmark only.
// Equivalent to -benchmem but scoped to a single function.
func BenchmarkStringBuilder(b *testing.B) {
    b.ReportAllocs()
    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        var sb strings.Builder
        for j := 0; j < 100; j++ {
            sb.WriteString("hello ")
        }
        _ = sb.String()
    }
}

// Table-driven benchmarks for scaling analysis
func BenchmarkSort(b *testing.B) {
    for _, size := range []int{10, 100, 1000, 10000} {
        b.Run(fmt.Sprintf("n=%d", size), func(b *testing.B) {
            src := make([]int, size)
            for i := range src {
                src[i] = rand.Intn(10000)
            }
            b.ResetTimer()
            for i := 0; i < b.N; i++ {
                tmp := make([]int, len(src))
                copy(tmp, src)
                sort.Ints(tmp)
            }
        })
    }
}

Running benchmarks:

# Basic run with allocation stats
go test -bench=. -benchmem ./...

# Longer run for stability (default is 1 s per benchmark)
go test -bench=BenchmarkStringContains -benchtime=10s -count=5

# Run a specific sub-benchmark (regex match)
go test -bench=BenchmarkSort/n=1000 -benchmem

# Collect CPU and memory profiles while benchmarking
go test -bench=. -cpuprofile=cpu.prof -memprofile=mem.prof
go tool pprof cpu.prof

benchstat: Statistical Comparison

Single-run benchmark output contains noise. Use multiple runs and benchstat to establish statistical significance:

# Install
go install golang.org/x/perf/cmd/benchstat@latest

# Collect baseline — 10 runs
go test -bench=BenchmarkStringBuilder -count=10 > old.txt

# Apply your change, then collect again
go test -bench=BenchmarkStringBuilder -count=10 > new.txt

# Compare
benchstat old.txt new.txt

Example output:

name              old time/op    new time/op    delta
StringBuilder-8   4.23µs ± 2%   1.87µs ± 1%   -55.80%  (p=0.000 n=10+10)

name              old alloc/op   new alloc/op   delta
StringBuilder-8   2.56kB ± 0%   1.02kB ± 0%   -60.16%  (p=0.000 n=10+10)

p=0.000 confirms statistical significance (t-test). ± 2% is the coefficient of variation; below 5% means the measurement is stable and the delta is trustworthy.

go tool trace: Concurrency Visualization

pprof answers "which function is slowest." go tool trace answers "in what order did things happen, and where was time lost to scheduling?"

import "runtime/trace"

f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// ... run the code under analysis ...

Via HTTP:

curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out
go tool trace trace.out
# Opens an interactive browser UI

Key views in the trace UI:

Reading a scheduler trace:

Healthy (CPU-bound, fully utilized):
P0: [====goroutine A====][====goroutine B====][====goroutine A====]
P1: [====goroutine C====][====goroutine D====][====goroutine C====]
→ Both Ps are saturated

Problematic (lock contention):
P0: [=A=][        waiting for lock        ][=A=]
P1: [=B======================][=B=]
→ P0 wastes a full core waiting; the lock on P1 is the bottleneck

Reading Flame Graphs

# Generate a flame graph via the pprof HTTP UI
go tool pprof -http=:8000 cpu.prof
# Select "Flame Graph" in the browser

Flame graph rules:

Reading tip: look for the widest "plateau" — a function that is wide at the top of a stack means CPU time is concentrated in that function itself (not in its callees). That function is your optimization target.

ASCII flame graph sketch:

                     ┌─────────┐
         ┌───────────┤  encode │
         │  marshal  └─────────┘
┌────────┴──────────────────────┬──────────────┐
│        processRequest         │  serialize   │
├───────────────────────────────┴──────────────┤
│                   handleHTTP                  │
└───────────────────────────────────────────────┘

encode is a wide plateau at 30% → it is the CPU hotspot

Level 4: Advanced Topics

Continuous Profiling in Production

Continuous profiling captures performance data under real traffic — the only way to observe issues that never appear in staging.

Rolling profile collection:

func startContinuousProfiling(dir string) {
    go func() {
        for range time.Tick(5 * time.Minute) {
            path := filepath.Join(dir,
                fmt.Sprintf("cpu-%s.prof", time.Now().Format("20060102-150405")))
            f, err := os.Create(path)
            if err != nil {
                log.Printf("pprof: create file: %v", err)
                continue
            }
            pprof.StartCPUProfile(f)
            time.Sleep(30 * time.Second)
            pprof.StopCPUProfile()
            f.Close()
        }
    }()
}

Open-source / commercial options:

runtime/trace Tasks and Regions

runtime/trace provides user-level tracing primitives for annotating your own code paths:

import "runtime/trace"

func processRequest(ctx context.Context, req *Request) (*Response, error) {
    // Task: a logical operation that may span goroutines
    ctx, task := trace.NewTask(ctx, "processRequest")
    defer task.End()

    // Region: a timed segment within a single goroutine
    trace.WithRegion(ctx, "parseJSON", func() {
        req.body = parseJSON(req.rawBody)
    })

    trace.WithRegion(ctx, "queryDB", func() {
        req.data = queryDB(ctx, req.body)
    })

    // Structured log entry visible in the trace viewer
    trace.Log(ctx, "requestID", req.ID)

    return buildResponse(req), nil
}

In go tool trace's "User-defined tasks" view you see latency distributions for each task and time breakdowns for each region within. This is more precise than sampling-based pprof — useful for diagnosing tail latency in specific business flows.

Escape Analysis and Allocation Profiles

Go's compiler decides whether a variable lives on the stack or the heap during escape analysis. Understanding escape analysis is the key to interpreting allocation profiles.

# Show escape analysis decisions
go build -gcflags="-m" ./...

# Verbose output
go build -gcflags="-m -m" ./...

Common escape triggers:

// Interface boxing — the concrete value escapes to the heap
func printAny(v interface{}) { fmt.Println(v) }
x := 42
printAny(x) // x escapes: boxing requires a heap allocation

// Returning a pointer — the pointed-to value must outlive the function
func newUser() *User {
    u := User{Name: "Alice"}
    return &u // u escapes: caller holds the address
}

// No escape — pointer used locally only
func localPtr() {
    u := User{Name: "Bob"}
    process(&u) // if process does not store &u, u stays on the stack
}

// Oversized allocation — too large for the stack
func bigSlice() []int {
    s := make([]int, 1_000_000) // escapes to heap regardless
    return s
}

The chain: escape → heap allocation → GC pressure → latency spikes.

When the allocs profile shows a function generating heavy allocation, run escape analysis on that code. Find out why variables escape (interface conversion? closure capture? pointer returned?) and eliminate the escape where possible.

Identifying False Sharing

False sharing occurs when two independent variables occupy the same CPU cache line (64 bytes) and are written by different goroutines simultaneously:

// Problematic: two counters share one cache line
type Counters struct {
    reads  int64  // offset 0
    writes int64  // offset 8
    // Both fit in one 64-byte cache line!
    // A write to either invalidates the other on every core.
}

// Fixed: padding forces each counter into its own cache line
type PaddedCounter struct {
    value int64
    _     [56]byte // pad to exactly 64 bytes
}

type Counters struct {
    reads  PaddedCounter
    writes PaddedCounter
}

False sharing is hard to see directly in pprof — it appears as elevated CPU time without a clear hot function. Confirmation typically requires hardware performance counters via perf stat on Linux or Instruments on macOS.

In go tool trace, a symptom is mutual scheduling interference between goroutines doing unrelated work: if goroutine A and goroutine B are independent yet each makes the other slower, false sharing is a plausible suspect.

GOGC Tuning Based on Profiling

GOGC controls when GC triggers:

GOGC=100 (default): trigger when heap size reaches 2× live data after previous GC
GOGC=200:           trigger at 3× live data (less frequent GC, more memory used)
GOGC=50:            trigger at 1.5× live data (more frequent GC, lower memory)
GOGC=off:           disable GC (extreme cases only)

Observe GC behavior:

GODEBUG=gctrace=1 ./myapp

# Output line format:
# gc 7 @3.812s 4%: 0.19+2.5+0.041 ms clock, 1.5+4.1/11/29+0.33 ms cpu,
#    42->43->22 MB, 44 MB goal, 8 P

# Fields:
# gc 7:         7th GC cycle
# @3.812s:      3.812 seconds into program execution
# 4%:           GC consumed 4% of CPU time
# 0.19+2.5+0.041 ms: STW sweep-termination + concurrent mark + STW mark-termination
# 42->43->22 MB: heap before GC → heap after GC → live data
# 44 MB goal:   next GC triggers at 44 MB (22 MB × (1 + GOGC/100))

Tuning workflow:

  1. Observe GC fraction in gctrace=1 output
  2. GC fraction > 20%: increase GOGC to reduce frequency (accept more memory usage)
  3. Memory footprint exceeds budget: decrease GOGC or fix allocation hotspots
  4. Go 1.19+: combine GOMEMLIMIT with GOGC for soft memory cap with adaptive GC:
import "runtime/debug"

// GC becomes more aggressive as heap approaches 500 MB;
// prevents OOM while keeping GOGC high most of the time
debug.SetMemoryLimit(500 << 20) // 500 MB

Summary

Performance analysis is empirical science. The complete toolkit:

Tool Purpose Question answered
go test -bench Reproducible experiments How fast is implementation A vs B?
benchstat Statistical comparison Is the improvement real or noise?
CPU profile (pprof) Sampled CPU hotspots Which function uses the most CPU?
allocs profile Allocation hotspots Who is generating GC pressure?
heap profile Live object analysis Where is memory leaking?
goroutine profile Goroutine state Are goroutines leaking?
block profile Synchronization stall time Where are goroutines waiting?
mutex profile Lock contention Which lock is most contested?
go tool trace Concurrency visualization Where did scheduling overhead go?

Two rules to memorize: no measurement, no optimization and change exactly one thing at a time. These two rules are the entire secret to switching from intuition-driven guessing to evidence-driven engineering.

Rate this chapter
4.6  / 5  (7 ratings)

💬 Comments