Performance Analysis: pprof, trace and Benchmark
Performance Analysis: pprof, trace and Benchmark
There is a category of engineer who, on encountering a performance problem, immediately opens the code, locates the section that "looks slow" by intuition, and starts optimizing. This practice has a name: Guess-Driven Development.
The typical outcome: two days spent tuning code that was not the bottleneck, no measurable performance gain, and three new bugs for the trouble.
Real performance engineering has one inviolable rule: measure first, then optimize. No measurement, no optimization.
This is not a suggestion — it is a constraint. Donald Knuth's 1974 observation that "premature optimization is the root of all evil" is cited endlessly and followed rarely. The reason is that intuition-driven optimization carries a seductive sense of certainty — "I know this is slow because..." — but program performance almost never behaves as intuition predicts, especially in a garbage-collected language.
Go ships a complete performance analysis toolchain: pprof (CPU, memory, goroutine, block, and mutex profiles), trace (concurrency visualization), and testing/benchmark (reproducible performance experiments). Mastering this toolchain is the engineering entry ticket to switching from guessing to measuring.
Level 1: The Performance Analysis Mindset
Why Intuition Fails
Modern program performance is determined by a constellation of nonlinear factors:
- CPU cache hierarchy: the same code runs 10x slower with a different access pattern
- Memory allocator: small allocations look cheap individually; GC pressure from millions of them is invisible until it explodes
- Goroutine scheduling: switch timing is opaque; lock contention does not appear until you profile it
- OS scheduling: when an M (OS thread) gets preempted is not under Go's control
- Microarchitecture effects: branch prediction, instruction pipelining, and NUMA topology all affect hot paths
Intuition models "source code complexity," not "runtime performance cost." The correlation between these two is far weaker than it feels.
The Correct Sequence
1. Define target metric (P99 latency? throughput? peak memory?)
│
▼
2. Establish a baseline (benchmark or production monitoring data)
│
▼
3. Collect a profile (CPU / memory / goroutine)
│
▼
4. Identify the bottleneck (which function uses the most CPU? where are allocations?)
│
▼
5. Make exactly one change (control variables)
│
▼
6. Re-run the benchmark, compare numbers
│
▼
7. If improved, continue. If not, revert.
Every step requires data. "Feels faster" is not evidence.
Common Premature Optimization Mistakes
Mistake 1: Optimizing a non-hot path
A function called a billion times at 1 ns each contributes 1 s of CPU time. A function called once taking 10 ms contributes 10 ms. Optimizing the billion-call function when the one-call function is the bottleneck is pure waste. Only a profile tells you which is which.
Mistake 2: Equating "allocation" with "slow"
In Go, a small heap allocation costs ~10 ns — fast enough to ignore in isolation. The cost is GC: millions of allocations force GC cycles, whose concurrent mark phase and stop-the-world pauses show up as latency spikes. The allocs profile finds allocation hotspots; it does not mean you should fear every make().
Mistake 3: Micro-optimizing without checking the macro picture
Shaving a function from 50 ns to 40 ns is meaningless if that function accounts for 0.1% of total runtime. Confirm with pprof's flat/cumulative breakdown that a function is actually hot before micro-optimizing it.
Level 2: pprof — Principles and Usage
Enabling the net/http/pprof Endpoint
A single blank import wires pprof into your HTTP server:
import _ "net/http/pprof"
The init() function in that package registers the following routes on DefaultServeMux:
GET /debug/pprof/ overview page (HTML)
GET /debug/pprof/profile CPU profile (default 30 s)
GET /debug/pprof/heap heap snapshot
GET /debug/pprof/goroutine stack traces of all goroutines
GET /debug/pprof/allocs memory allocation profile
GET /debug/pprof/block goroutine blocking events
GET /debug/pprof/mutex mutex contention
GET /debug/pprof/threadcreate OS thread creation
GET /debug/pprof/trace runtime execution trace
If your service uses a custom router (not DefaultServeMux), register manually:
import (
"net/http/pprof"
"github.com/gorilla/mux"
)
func registerPprof(r *mux.Router) {
r.HandleFunc("/debug/pprof/", pprof.Index)
r.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
r.HandleFunc("/debug/pprof/profile", pprof.Profile)
r.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
r.HandleFunc("/debug/pprof/trace", pprof.Trace)
r.Handle("/debug/pprof/goroutine", pprof.Handler("goroutine"))
r.Handle("/debug/pprof/heap", pprof.Handler("heap"))
r.Handle("/debug/pprof/allocs", pprof.Handler("allocs"))
r.Handle("/debug/pprof/block", pprof.Handler("block"))
r.Handle("/debug/pprof/mutex", pprof.Handler("mutex"))
}
Production security: pprof endpoints expose program internals. Restrict them to loopback or internal networks:
// Expose pprof on a separate internal port
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
CPU Profile: Sampling and Interpretation
The CPU profiler is a sampling profiler. The Go runtime sends SIGPROF to the process every 10 ms (default), capturing the call stacks of all running M threads at that instant.
Sampling rate: 100 Hz (one sample every 10 ms)
Default duration: 30 s → ~3000 sample points
Profile content: complete call stack at each sample
Interpretation: number of times a function appears ≈ fraction of CPU time
Collecting a CPU profile:
# Via HTTP endpoint (production)
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Programmatically
import "runtime/pprof"
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
// ... run the code to analyze ...
Analyzing the profile:
go tool pprof cpu.prof
# Common commands
(pprof) top10 # sorted by flat time; top 10 functions
(pprof) top10 -cum # sorted by cumulative time
(pprof) list funcName # source-annotated view for one function
(pprof) web # SVG call graph in browser (requires graphviz)
Flat vs cumulative — what each means:
Call tree:
main()
└── processRequest() cumulative = 1000 ms
├── parseJSON() flat = 200 ms, cumulative = 200 ms
├── queryDB() flat = 100 ms, cumulative = 700 ms
│ └── marshalSQL() flat = 600 ms, cumulative = 600 ms
└── serialize() flat = 100 ms, cumulative = 100 ms
flat time: time the function itself ran (excluding callees)
cumulative time: total time including all callees in the subtree
marshalSQL has the highest flat → it is the true CPU hotspot
queryDB has high cumulative → it is the entry point to the slow path
Memory Profile: allocs vs heap
Go offers two distinct memory profiles:
allocs profile (allocation trace):
- Records every heap allocation since program start, including objects already collected
- Use to find what is generating GC pressure
- Sampled: one sample per 512 KB of allocations (by default)
heap profile (heap snapshot):
- Records currently live heap objects
- Use to find memory leaks ("who holds large amounts of memory without releasing it")
- Reflects the actual heap state after the last GC
# Allocation profile
go tool pprof http://localhost:6060/debug/pprof/allocs
# Heap snapshot
go tool pprof http://localhost:6060/debug/pprof/heap
# Switch sample index inside the interactive session
(pprof) sample_index=alloc_space # bytes allocated (allocs profile)
(pprof) sample_index=alloc_objects # object count allocated
(pprof) sample_index=inuse_space # bytes currently live (heap profile)
(pprof) sample_index=inuse_objects # live object count
Goroutine Profile: Detecting Leaks
go tool pprof http://localhost:6060/debug/pprof/goroutine
# Raw text output — often more readable
curl http://localhost:6060/debug/pprof/goroutine?debug=2
Example output (excerpt):
goroutine 1 [running]:
main.main()
/app/main.go:45 +0x1a
goroutine 18 [chan receive, 3 minutes]: ← waiting 3 min — likely leaked
net/http.(*persistConn).readLoop(0xc0001a4200)
/usr/local/go/src/net/http/transport.go:2205 +0x99
goroutine 42 [select]:
database/sql.(*DB).connectionOpener(0xc0000b4000, 0x...)
/usr/local/go/src/database/sql/sql.go:1189 +0x66
A monotonically increasing goroutine count over time signals a leak. The goroutine profile shows which type of goroutine is accumulating.
block Profile and mutex Profile
Both are disabled by default:
import "runtime"
// block profile: records time goroutines spend blocked on synchronization
// rate=1 captures everything (expensive); use 10000 in production
runtime.SetBlockProfileRate(10000) // sample one event per 10000 ns of blocking
// mutex profile: records mutex contention events
// fraction=5 means roughly 1/5 of contention events are recorded
runtime.SetMutexProfileFraction(5)
go tool pprof http://localhost:6060/debug/pprof/block
go tool pprof http://localhost:6060/debug/pprof/mutex
block profile reveals:
- Channel send/receive stalls
sync.Mutex.Lock()waiting time- Syscall blocking time (for goroutines not managed by netpoll)
mutex profile reveals:
- Functions holding a lock for a long time
- Highly contended mutexes with many goroutines waiting
Level 3: Code Practice
Writing Benchmarks Correctly
// string_bench_test.go
package strutil_test
import (
"fmt"
"math/rand"
"sort"
"strings"
"testing"
"unicode/utf8"
)
// Benchmark functions must be named Benchmark*
func BenchmarkStringContains(b *testing.B) {
haystack := strings.Repeat("hello world ", 1000)
needle := "world"
// b.ResetTimer() after expensive setup
// so setup time is excluded from the measurement
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = strings.Contains(haystack, needle)
}
}
// Comparative benchmarks — same operation, two implementations
func BenchmarkRuneCount_Len(b *testing.B) {
s := strings.Repeat("Hello World", 1000)
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = len(s) // byte count — O(1)
}
}
func BenchmarkRuneCount_Utf8(b *testing.B) {
s := strings.Repeat("Hello World", 1000)
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = utf8.RuneCountInString(s) // rune count — O(n)
}
}
// b.ReportAllocs() forces allocation output for this benchmark only.
// Equivalent to -benchmem but scoped to a single function.
func BenchmarkStringBuilder(b *testing.B) {
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
var sb strings.Builder
for j := 0; j < 100; j++ {
sb.WriteString("hello ")
}
_ = sb.String()
}
}
// Table-driven benchmarks for scaling analysis
func BenchmarkSort(b *testing.B) {
for _, size := range []int{10, 100, 1000, 10000} {
b.Run(fmt.Sprintf("n=%d", size), func(b *testing.B) {
src := make([]int, size)
for i := range src {
src[i] = rand.Intn(10000)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
tmp := make([]int, len(src))
copy(tmp, src)
sort.Ints(tmp)
}
})
}
}
Running benchmarks:
# Basic run with allocation stats
go test -bench=. -benchmem ./...
# Longer run for stability (default is 1 s per benchmark)
go test -bench=BenchmarkStringContains -benchtime=10s -count=5
# Run a specific sub-benchmark (regex match)
go test -bench=BenchmarkSort/n=1000 -benchmem
# Collect CPU and memory profiles while benchmarking
go test -bench=. -cpuprofile=cpu.prof -memprofile=mem.prof
go tool pprof cpu.prof
benchstat: Statistical Comparison
Single-run benchmark output contains noise. Use multiple runs and benchstat to establish statistical significance:
# Install
go install golang.org/x/perf/cmd/benchstat@latest
# Collect baseline — 10 runs
go test -bench=BenchmarkStringBuilder -count=10 > old.txt
# Apply your change, then collect again
go test -bench=BenchmarkStringBuilder -count=10 > new.txt
# Compare
benchstat old.txt new.txt
Example output:
name old time/op new time/op delta
StringBuilder-8 4.23µs ± 2% 1.87µs ± 1% -55.80% (p=0.000 n=10+10)
name old alloc/op new alloc/op delta
StringBuilder-8 2.56kB ± 0% 1.02kB ± 0% -60.16% (p=0.000 n=10+10)
p=0.000 confirms statistical significance (t-test). ± 2% is the coefficient of variation; below 5% means the measurement is stable and the delta is trustworthy.
go tool trace: Concurrency Visualization
pprof answers "which function is slowest." go tool trace answers "in what order did things happen, and where was time lost to scheduling?"
import "runtime/trace"
f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// ... run the code under analysis ...
Via HTTP:
curl http://localhost:6060/debug/pprof/trace?seconds=5 > trace.out
go tool trace trace.out
# Opens an interactive browser UI
Key views in the trace UI:
- Goroutine analysis: individual goroutine lifetimes — running, runnable, blocked, sleeping
- Scheduler trace: real-time GMP state; P utilization per core
- Heap: GC trigger points overlaid on heap growth
- Blocking profile: aggregated blocking time by cause
Reading a scheduler trace:
Healthy (CPU-bound, fully utilized):
P0: [====goroutine A====][====goroutine B====][====goroutine A====]
P1: [====goroutine C====][====goroutine D====][====goroutine C====]
→ Both Ps are saturated
Problematic (lock contention):
P0: [=A=][ waiting for lock ][=A=]
P1: [=B======================][=B=]
→ P0 wastes a full core waiting; the lock on P1 is the bottleneck
Reading Flame Graphs
# Generate a flame graph via the pprof HTTP UI
go tool pprof -http=:8000 cpu.prof
# Select "Flame Graph" in the browser
Flame graph rules:
- X axis: time (sample count); width = fraction of total CPU time
- Y axis: call stack depth (bottom = main, top = deepest callee)
- Color: random, carries no semantic meaning in most tools
Reading tip: look for the widest "plateau" — a function that is wide at the top of a stack means CPU time is concentrated in that function itself (not in its callees). That function is your optimization target.
ASCII flame graph sketch:
┌─────────┐
┌───────────┤ encode │
│ marshal └─────────┘
┌────────┴──────────────────────┬──────────────┐
│ processRequest │ serialize │
├───────────────────────────────┴──────────────┤
│ handleHTTP │
└───────────────────────────────────────────────┘
encode is a wide plateau at 30% → it is the CPU hotspot
Level 4: Advanced Topics
Continuous Profiling in Production
Continuous profiling captures performance data under real traffic — the only way to observe issues that never appear in staging.
Rolling profile collection:
func startContinuousProfiling(dir string) {
go func() {
for range time.Tick(5 * time.Minute) {
path := filepath.Join(dir,
fmt.Sprintf("cpu-%s.prof", time.Now().Format("20060102-150405")))
f, err := os.Create(path)
if err != nil {
log.Printf("pprof: create file: %v", err)
continue
}
pprof.StartCPUProfile(f)
time.Sleep(30 * time.Second)
pprof.StopCPUProfile()
f.Close()
}
}()
}
Open-source / commercial options:
- Pyroscope: open-source continuous profiling platform supporting Go, Python, Ruby, and eBPF
- Datadog Continuous Profiler: commercial, integrated with APM traces
- Google Cloud Profiler: managed solution on GCP
runtime/trace Tasks and Regions
runtime/trace provides user-level tracing primitives for annotating your own code paths:
import "runtime/trace"
func processRequest(ctx context.Context, req *Request) (*Response, error) {
// Task: a logical operation that may span goroutines
ctx, task := trace.NewTask(ctx, "processRequest")
defer task.End()
// Region: a timed segment within a single goroutine
trace.WithRegion(ctx, "parseJSON", func() {
req.body = parseJSON(req.rawBody)
})
trace.WithRegion(ctx, "queryDB", func() {
req.data = queryDB(ctx, req.body)
})
// Structured log entry visible in the trace viewer
trace.Log(ctx, "requestID", req.ID)
return buildResponse(req), nil
}
In go tool trace's "User-defined tasks" view you see latency distributions for each task and time breakdowns for each region within. This is more precise than sampling-based pprof — useful for diagnosing tail latency in specific business flows.
Escape Analysis and Allocation Profiles
Go's compiler decides whether a variable lives on the stack or the heap during escape analysis. Understanding escape analysis is the key to interpreting allocation profiles.
# Show escape analysis decisions
go build -gcflags="-m" ./...
# Verbose output
go build -gcflags="-m -m" ./...
Common escape triggers:
// Interface boxing — the concrete value escapes to the heap
func printAny(v interface{}) { fmt.Println(v) }
x := 42
printAny(x) // x escapes: boxing requires a heap allocation
// Returning a pointer — the pointed-to value must outlive the function
func newUser() *User {
u := User{Name: "Alice"}
return &u // u escapes: caller holds the address
}
// No escape — pointer used locally only
func localPtr() {
u := User{Name: "Bob"}
process(&u) // if process does not store &u, u stays on the stack
}
// Oversized allocation — too large for the stack
func bigSlice() []int {
s := make([]int, 1_000_000) // escapes to heap regardless
return s
}
The chain: escape → heap allocation → GC pressure → latency spikes.
When the allocs profile shows a function generating heavy allocation, run escape analysis on that code. Find out why variables escape (interface conversion? closure capture? pointer returned?) and eliminate the escape where possible.
Identifying False Sharing
False sharing occurs when two independent variables occupy the same CPU cache line (64 bytes) and are written by different goroutines simultaneously:
// Problematic: two counters share one cache line
type Counters struct {
reads int64 // offset 0
writes int64 // offset 8
// Both fit in one 64-byte cache line!
// A write to either invalidates the other on every core.
}
// Fixed: padding forces each counter into its own cache line
type PaddedCounter struct {
value int64
_ [56]byte // pad to exactly 64 bytes
}
type Counters struct {
reads PaddedCounter
writes PaddedCounter
}
False sharing is hard to see directly in pprof — it appears as elevated CPU time without a clear hot function. Confirmation typically requires hardware performance counters via perf stat on Linux or Instruments on macOS.
In go tool trace, a symptom is mutual scheduling interference between goroutines doing unrelated work: if goroutine A and goroutine B are independent yet each makes the other slower, false sharing is a plausible suspect.
GOGC Tuning Based on Profiling
GOGC controls when GC triggers:
GOGC=100 (default): trigger when heap size reaches 2× live data after previous GC
GOGC=200: trigger at 3× live data (less frequent GC, more memory used)
GOGC=50: trigger at 1.5× live data (more frequent GC, lower memory)
GOGC=off: disable GC (extreme cases only)
Observe GC behavior:
GODEBUG=gctrace=1 ./myapp
# Output line format:
# gc 7 @3.812s 4%: 0.19+2.5+0.041 ms clock, 1.5+4.1/11/29+0.33 ms cpu,
# 42->43->22 MB, 44 MB goal, 8 P
# Fields:
# gc 7: 7th GC cycle
# @3.812s: 3.812 seconds into program execution
# 4%: GC consumed 4% of CPU time
# 0.19+2.5+0.041 ms: STW sweep-termination + concurrent mark + STW mark-termination
# 42->43->22 MB: heap before GC → heap after GC → live data
# 44 MB goal: next GC triggers at 44 MB (22 MB × (1 + GOGC/100))
Tuning workflow:
- Observe GC fraction in
gctrace=1output - GC fraction > 20%: increase
GOGCto reduce frequency (accept more memory usage) - Memory footprint exceeds budget: decrease
GOGCor fix allocation hotspots - Go 1.19+: combine
GOMEMLIMITwithGOGCfor soft memory cap with adaptive GC:
import "runtime/debug"
// GC becomes more aggressive as heap approaches 500 MB;
// prevents OOM while keeping GOGC high most of the time
debug.SetMemoryLimit(500 << 20) // 500 MB
Summary
Performance analysis is empirical science. The complete toolkit:
| Tool | Purpose | Question answered |
|---|---|---|
go test -bench |
Reproducible experiments | How fast is implementation A vs B? |
benchstat |
Statistical comparison | Is the improvement real or noise? |
| CPU profile (pprof) | Sampled CPU hotspots | Which function uses the most CPU? |
| allocs profile | Allocation hotspots | Who is generating GC pressure? |
| heap profile | Live object analysis | Where is memory leaking? |
| goroutine profile | Goroutine state | Are goroutines leaking? |
| block profile | Synchronization stall time | Where are goroutines waiting? |
| mutex profile | Lock contention | Which lock is most contested? |
go tool trace |
Concurrency visualization | Where did scheduling overhead go? |
Two rules to memorize: no measurement, no optimization and change exactly one thing at a time. These two rules are the entire secret to switching from intuition-driven guessing to evidence-driven engineering.