Performance Analysis
Chapter 14: Linux Performance Analysis
Performance analysis is systematic science, not guesswork. From USE/RED methodology through top/vmstat/iostat basics, strace syscall tracing, perf hardware counters, eBPF/bpftrace kernel-level observability, and FlameGraph visualization — this chapter builds a complete Linux performance troubleshooting toolkit, anchored by a full "CPU 100%" investigation case study.
1. Performance Analysis Methodology
USE Model
Proposed by Brendan Gregg: for every resource (CPU, memory, disk, network, bus) check three dimensions:
- U — Utilization(利用率):percentage of time the resource is busy (CPU 80% = 80% of time doing work)
- S — Saturation(饱和度):degree of overload (run queue length, swap usage, disk await time)
- E — Errors(错误数):error event count (NIC drops, disk errors, memory ECC corrections)
RED Model
Designed for request-oriented microservices, proposed by Tom Wilkie. For each service check:
- R — Rate(请求速率):requests per second (QPS/RPS)
- E — Errors(错误率):failed requests per second (5xx count)
- D — Duration(延迟):request processing time distribution (p50/p95/p99)
60-Second Troubleshooting Checklist
uptime # 负载均值,判断趋势(1/5/15分钟)
dmesg | tail -20 # 内核最近错误信息
vmstat 1 5 # 每秒采样5次,看CPU/内存/IO
mpstat -P ALL 1 3 # 每颗CPU利用率
pidstat 1 5 # 各进程CPU使用情况
iostat -xz 1 3 # 磁盘IO利用率与等待时间
free -h # 内存与swap使用
sar -n DEV 1 3 # 网络接口流量
sar -n TCP,ETCP 1 3 # TCP连接统计
top # 总体概览,按CPU/内存排序
2. Core Performance Tools
top / htop
top header CPU line field meanings:
| Field | Meaning | Warning threshold |
|---|---|---|
| %us | User-space CPU (app code) | >70% |
| %sy | Kernel-space CPU (syscalls) | >20% |
| %ni | Niced process CPU | — |
| %id | Idle CPU | 5% |
| %hi | Hardware interrupt handling (driver layer) | >5% |
| %si | Software interrupt (network RX, timers) | >5% |
| %st | CPU stolen by hypervisor (steal time) | >5% |
vmstat
vmstat 1 10 # 每秒采样,共10次
# 输出列说明:
# procs: r(运行队列长度)b(阻塞于不可中断睡眠的进程数)
# memory: swpd free buff cache(单位KB)
# swap: si(swap in KB/s)so(swap out KB/s)
# io: bi(磁盘读 blocks/s)bo(磁盘写 blocks/s)
# system: in(中断/s)cs(上下文切换/s)
# cpu: us sy id wa st
# 关键信号:
# r 持续 > CPU 核数 → CPU 饱和
# so > 0 持续 → 内存不足在换页
# cs 极高 → 过多线程切换或系统调用
iostat -dx
iostat -dx 1 5 # 扩展磁盘统计,每秒采样
# 关键指标:
# r/s w/s 读写请求数/秒
# rkB/s wkB/s 读写吞吐量 KB/s
# await 平均IO等待时间(ms)—— HDD正常 0 持续增长 → 内存不足,性能会大幅下降
sar — Historical data replay
# sar 数据由 sysstat 服务每10分钟收集一次,存于 /var/log/sa/
# 查看昨天的 CPU 利用率
sar -u -f /var/log/sa/sa$(date -d yesterday +%d)
# 查看今天的磁盘IO
sar -d 1 5
# 查看网络接口流量
sar -n DEV 1 5
# 查看运行队列与负载
sar -q 1 5
# 查看内存页换入换出
sar -B 1 5
3. strace: Syscall Tracing
strace intercepts and records all syscalls of a target process using ptrace, making it ideal for debugging "stuck processes", "unexpected file access", and "permission errors". Warning: strace significantly degrades the target process — use with caution in production.
# 跟踪新启动的进程
strace ls /tmp
# 附加到运行中的进程
strace -p 1234
# 带时间戳和耗时(-tt 微秒时间,-T 每个调用耗时)
strace -tt -T -p 1234
# 只关注特定系统调用(文件访问类)
strace -e trace=openat,read,write,close -p 1234
# 统计模式:显示每个调用的次数和总耗时(最有用)
strace -c -p 1234
# 输出示例:
# % time seconds usecs/call calls errors syscall
# 45.23 0.001234 12 100 0 epoll_wait
# 23.10 0.000631 6 100 0 read
# 跟踪子进程
strace -f -p 1234
# 常见问题诊断:
# 进程卡住看到 futex(...) → 锁竞争
# 进程卡住看到 epoll_wait → 正常等待IO事件
# 大量 stat() 调用 → 路径查找慢,检查 inode cache
4. ltrace: Library Call Tracing
# 跟踪动态库函数调用(用户态,不进入内核)
ltrace ./myapp
# 附加到运行中进程
ltrace -p 1234
# 只跟踪 malloc/free 内存分配
ltrace -e malloc,free,realloc -p 1234
# 统计调用次数
ltrace -c ./myapp
# 同时显示系统调用(-S)
ltrace -S ./myapp
# 分析动态链接依赖
ldd /usr/bin/nginx
objdump -p /usr/bin/nginx | grep NEEDED
5. perf: Hardware Performance Counters
perf directly accesses the CPU's PMU (Performance Monitoring Unit) with extremely low overhead (typically perf.out
### perf top — live hotspot
```bash
# 实时显示最热函数(类似 top 但针对 CPU 指令)
perf top
# 只看某进程
perf top -p 1234
# perf trace:类 strace 但开销更低(基于 eBPF/tracepoint)
perf trace -p 1234
perf trace -e openat,read -p 1234
Prerequisites: Using perf requires: ① install
linux-tools-$(uname -r), ②kernel.perf_event_paranoid=1(allow non-root sampling), ③ binaries with debug symbols (Go includes them by default; C/C++ needs-gor install-dbgpackages).
6. FlameGraph Visualization
Flame graphs were invented by Brendan Gregg: the X-axis represents CPU time proportion (wider = more time), Y-axis represents call stack depth (upper calls lower), colors are random (no meaning). Flame graphs immediately reveal CPU hotspot functions.
#!/usr/bin/env bash
# flamegraph.sh — 完整火焰图生成脚本
set -euo pipefail
PID="${1:?Usage: flamegraph.sh PID [duration_seconds]}"
DURATION="${2:-30}"
OUTPUT_DIR="${3:-/tmp/flamegraph-$(date +%Y%m%d-%H%M%S)}"
FLAMEGRAPH_DIR="/opt/FlameGraph"
# 安装 FlameGraph 工具(若未安装)
if [ ! -d "$FLAMEGRAPH_DIR" ]; then
git clone --depth=1 https://github.com/brendangregg/FlameGraph.git "$FLAMEGRAPH_DIR"
fi
mkdir -p "$OUTPUT_DIR"
echo "[1/4] Sampling PID $PID for ${DURATION}s ..."
perf record -g --call-graph dwarf \
-o "$OUTPUT_DIR/perf.data" \
-p "$PID" \
sleep "$DURATION"
echo "[2/4] Generating perf script ..."
perf script -i "$OUTPUT_DIR/perf.data" > "$OUTPUT_DIR/perf.out"
echo "[3/4] Collapsing stacks ..."
"$FLAMEGRAPH_DIR/stackcollapse-perf.pl" \
"$OUTPUT_DIR/perf.out" > "$OUTPUT_DIR/folded.out"
echo "[4/4] Rendering flame graph ..."
"$FLAMEGRAPH_DIR/flamegraph.pl" \
--title "CPU Flame Graph — PID $PID (${DURATION}s)" \
--width 1400 \
"$OUTPUT_DIR/folded.out" > "$OUTPUT_DIR/flamegraph.svg"
echo "Done: $OUTPUT_DIR/flamegraph.svg"
echo "Open with: xdg-open $OUTPUT_DIR/flamegraph.svg"
Off-CPU Flame Graph (Wait Time)
# Off-CPU 火焰图分析线程在非运行状态的时间(IO等待、锁等待)
# 需要 eBPF 支持
git clone https://github.com/brendangregg/FlameGraph.git /opt/FlameGraph
# 使用 bpftrace 采集 off-cpu 数据
bpftrace -e '
tracepoint:sched:sched_switch {
if (args->prev_state == TASK_INTERRUPTIBLE || args->prev_state == TASK_UNINTERRUPTIBLE) {
@start[args->prev_pid] = nsecs;
}
}
tracepoint:sched:sched_switch {
if (@start[args->next_pid]) {
@offcpu[args->next_comm, args->next_pid] =
hist(nsecs - @start[args->next_pid]);
delete(@start[args->next_pid]);
}
}' > /tmp/offcpu.txt
7. eBPF and bpftrace
eBPF (extended Berkeley Packet Filter) allows safely running user-written programs inside the kernel without modifying kernel source or loading kernel modules. eBPF programs pass through a static verifier guaranteeing they cannot crash the kernel. bpftrace is a high-level eBPF scripting language with awk-like syntax.
bpftrace Common One-liners
| Goal | bpftrace command |
|---|---|
| Trace all open() calls | bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }' |
| Count slow disk IO (>1ms) | bpftrace -e 'tracepoint:block:block_rq_complete { if (args->nr_sector > 0) { @lat = hist((nsecs - @start[args->sector]) / 1000); } }' |
| Trace new TCP connections | bpftrace -e 'kprobe:tcp_connect { printf("%s → %d\n", comm, pid); }' |
| Count syscalls per process | bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' |
| Histogram of function latency | bpftrace -e 'uprobe:/usr/bin/myapp:main { @start=nsecs; } uretprobe:/usr/bin/myapp:main { @lat=hist(nsecs-@start); }' |
| Trace OOM kill events | bpftrace -e 'kprobe:oom_kill_process { printf("OOM kill: %s (pid=%d)\n", comm, pid); }' |
| Count kernel function call rate | bpftrace -e 'kprobe:vfs_read { @[kstack] = count(); } interval:s:5 { print(@); clear(@); }' |
BCC Tool Collection
# 安装 BCC(Ubuntu/Debian)
apt install bpfcc-tools linux-headers-$(uname -r)
# opensnoop: 跟踪所有 open 调用
opensnoop-bpfcc
# execsnoop: 跟踪新进程执行
execsnoop-bpfcc
# tcptop: 按进程统计 TCP 流量
tcptop-bpfcc
# biolatency: 磁盘 IO 延迟分布(柱状图)
biolatency-bpfcc -d 10
# runqlat: CPU 运行队列等待时间分布
runqlat-bpfcc 10 1
# profile: 采样 CPU 调用栈(用于火焰图)
profile-bpfcc -F 99 -f 30 > /tmp/out.stacks
/opt/FlameGraph/flamegraph.pl /tmp/out.stacks > profile.svg
8. Network and IO Monitoring Tools
# iotop: 按进程实时显示磁盘IO(需 root)
iotop -ao # -a 累积模式,-o 只显示有IO的进程
# nethogs: 按进程统计网络带宽(需 root)
nethogs eth0
# iftop: 按连接对统计带宽(需 root)
iftop -i eth0
# nload: 接口级带宽实时图
nload eth0
# ss: 套接字统计(更快的 netstat 替代)
ss -tunap # 显示所有 TCP/UDP 连接和进程
ss -s # 连接数统计摘要
ss -o state ESTABLISHED '( dport = :80 or sport = :80 )' # 过滤80端口
# 网卡丢包统计
ip -s link show eth0
ethtool -S eth0 | grep -i drop
9. Memory Deep Dive
# /proc/meminfo 关键字段
cat /proc/meminfo
# MemTotal: — 总物理内存
# MemFree: — 完全空闲(不含缓存)
# MemAvailable: — 实际可用(包含可回收缓存)
# Buffers: — 块设备缓冲
# Cached: — 页缓存(文件内容缓存)
# SwapCached: — 已被换回内存但映射还在swap的页
# Active/Inactive: — 页活跃状态(影响回收策略)
# Dirty: — 待刷盘的脏页(高值说明IO写压力大)
# Writeback: — 正在回写磁盘的页
# Slab: — 内核 slab 分配器用量
# VmallocUsed: — vmalloc 分配用量
# slabtop: 内核 slab 缓存占用(dentry/inode cache 常见大户)
slabtop
# smem: 精确进程内存占用(PSS 比 RSS 更准确)
# PSS = Private + 按比例分配的共享内存
smem -r -k -s pss | head -20
smem -P nginx -k # 按进程名过滤
# valgrind 内存泄漏检测(仅用于测试环境)
valgrind --tool=memcheck --leak-check=full ./myapp
# valgrind heap profiling(内存分配热点)
valgrind --tool=massif --pages-as-heap=yes ./myapp
ms_print massif.out.* | head -100
10. Practice: Full "CPU 100%" Investigation
The following is a complete CPU 100% investigation walkthrough, from alert receipt to root cause identification:
## 步骤1:确认现象
uptime
# 输出:load average: 15.23, 14.98, 13.01
# → 持续高负载,15分钟均值高,不是瞬时尖峰
## 步骤2:定位进程
top -b -n 1 | head -30
# 发现 PID 2341 myapp 进程 CPU 占 790%(8核机器)
## 步骤3:确认是用户态还是内核态 CPU
pidstat -u -p 2341 1 5
# %user=785 %system=5 → 用户态热点,在应用代码中
## 步骤4:查看线程级 CPU(找到热线程)
top -H -p 2341
# 发现线程 TID 2345 占 CPU 99%
ps -Lp 2341 -o pid,tid,pcpu,comm
# 找到热线程 tid=2345
## 步骤5:perf 采样(30秒)
perf record -g --call-graph dwarf -p 2341 sleep 30
perf report --stdio | head -60
# 输出热函数:json.Marshal → 74.3%
# → JSON 序列化占了绝大部分 CPU
## 步骤6:生成火焰图确认
perf script > /tmp/perf.out
/opt/FlameGraph/stackcollapse-perf.pl /tmp/perf.out > /tmp/folded.out
/opt/FlameGraph/flamegraph.pl /tmp/folded.out > /tmp/flame.svg
# 火焰图显示 json.Marshal 调用链宽度占全图 70%+
## 步骤7:用 bpftrace 统计调用频率
bpftrace -e '
uprobe:/opt/myapp/bin/myapp:encoding/json.Marshal {
@calls = count();
}
interval:s:1 {
print(@calls);
clear(@calls);
}' &
# 输出:每秒调用 50,000 次
## 步骤8:代码层面确认
# 审查代码发现:热点 HTTP handler 对每个请求都进行
# 完整对象序列化,而大部分字段在请求间不变化
# → 根因:缺少序列化结果缓存
## 步骤9:修复与验证
# 添加 sync.Map 缓存序列化结果,TTL=1s
# 重新部署后:
perf stat -p $(pgrep myapp) sleep 10
# IPC 从 0.3 提升到 2.1,CPU 使用率从 790% 降至 45%
Investigation summary: Full investigation path: uptime → top → pidstat → perf record → perf report → FlameGraph → bpftrace verification → code review → fix. Each step narrows the search: machine → process → thread → function → code line.
Previous
← Ch13: systemd
Next
Ch15: Security →