System Calls
Chapter 17: Syscalls and Kernel Interface
System calls are the only legitimate boundary between user programs and the kernel — every open(), read(), and write() involves a privilege-level switch. This chapter starts from the x86_64 ABI, then covers full strace call-chain analysis, the /proc virtual filesystem, sysctl kernel parameter tuning, the /sys device tree, kernel module management, and dmesg log interpretation — all tied together by a complete production "file descriptor leak" investigation.
1. Syscall Internals
User Mode vs Kernel Mode
Modern CPUs isolate user code from kernel code via privilege rings (x86 calls them Rings). User programs run in Ring 3 (lowest privilege); the kernel runs in Ring 0 (highest privilege). Ring 3 code cannot directly access hardware or other processes' memory — it must trap into the kernel via a system call. This mechanism is both a security boundary and a performance cost: each syscall averages 100–300 ns (register save, TLB flush, context switch, etc.).
Two Syscall Trigger Mechanisms
| Instruction | When Used | Performance |
|---|---|---|
| int 0x80 | Legacy 32-bit x86, traps via software interrupt | Slower (~1 µs) |
| syscall | x86_64 fast path, switches via MSR registers directly | Fast (~100–300 ns) |
| VDSO | Kernel maps some calls (e.g. gettimeofday) into user address space — no kernel entry at all | Extremely fast (nanoseconds) |
x86_64 Syscall ABI Convention
x86_64 Linux uses the following register convention to pass the syscall number and arguments (defined in /usr/include/asm/unistd_64.h):
- rax: syscall number (e.g. read=0, write=1, open=2, fork=57)
- rdi, rsi, rdx, r10, r8, r9: 1st–6th arguments
- rax (return value): non-negative on success, -errno on failure
# 查看所有系统调用号(x86_64)
grep -r "^#define __NR_" /usr/include/asm/unistd_64.h | head -20
# 用 ausyscall 查询(audit 包提供)
ausyscall --dump | grep -E "^(0|1|2|3|60)"
# 系统调用总数
grep -c "^#define __NR_" /usr/include/asm/unistd_64.h
# 查看 VDSO 映射(/proc/self/maps 中 [vdso] 行)
cat /proc/self/maps | grep vdso
# 7fff12345000-7fff12346000 r-xp 00000000 00:00 0 [vdso]
2. strace Deep Dive
strace uses ptrace(2) to pause the target process at every kernel boundary, recording arguments and return values. Warning: strace slows the target process by 10–100x. In production prefer -c summary mode or use bpftrace instead.
Common Options Quick Reference
| Option | Meaning |
|---|---|
| -p PID | Attach to running process |
| -tt | Show microsecond-precision absolute timestamps |
| -T | Show time spent in each call at end of line |
| -e trace=TYPE | Filter by category (file/network/process/signal/ipc/desc) |
| -e trace=open,read | Trace only specific call names |
| -c | Summary mode: count calls and total time (lowest overhead) |
| -f | Follow child processes created by fork/clone |
| -o FILE | Write output to file instead of stderr |
| -s 256 | String truncation length (default 32; increase to see full paths) |
| -y | Print fd numbers as file paths (more readable) |
# 基础:跟踪新启动命令的所有系统调用
strace ls /tmp
# 附加到已有进程(Ctrl+C 停止,不影响进程)
strace -p 1234
# 时间戳 + 耗时(排查启动慢)
strace -tt -T -p 1234 2>&1 | head -50
# 只看文件类调用(open/stat/read/write/close)
strace -e trace=file -p 1234
# 只看网络调用(connect/bind/accept/send/recv)
strace -e trace=network -p 1234
# 统计模式——生产排查推荐
strace -c -p 1234
# 样例输出:
# % time seconds usecs/call calls errors syscall
# ------- ----------- ----------- --------- --------- ----------------
# 52.13 0.002341 23 100 0 epoll_wait
# 21.30 0.000957 9 100 0 read
# 8.44 0.000379 7 54 12 openat
# 7.23 0.000325 3 90 0 write
# 跟踪子进程(-ff 每个子进程写独立文件)
strace -f -o /tmp/strace.out nginx
# 排查程序启动慢:找大量 stat() 调用
strace -tt -T -e trace=stat,statx,lstat ./slow_app 2>&1 | \
awk '{ sum += $NF; n++ } END { print n, "calls, total:", sum, "s" }'
# 排查文件权限问题:看 openat 返回的 EACCES/ENOENT
strace -e trace=openat -s 256 -y ./myapp 2>&1 | grep -E "EACCES|ENOENT"
# 排查网络连接失败
strace -e trace=connect -p 1234 2>&1 | grep -E "ECONNREFUSED|ETIMEDOUT"
3. ltrace: Library Call Tracing
ltrace operates in user space, intercepting dynamic library function calls via PLT (Procedure Linkage Table) hooks — no kernel entry required. It complements strace for analyzing malloc/free memory allocation, libc string operations, and other user-space behavior.
# 跟踪动态库调用
ltrace ./myapp
# 附加到运行中进程
ltrace -p 1234
# 只跟踪内存分配(排查内存泄漏)
ltrace -e malloc,free,realloc,calloc -p 1234
# 统计调用次数与耗时
ltrace -c ./myapp
# 同时显示系统调用(-S)
ltrace -S ./myapp
# 查看程序的动态库依赖
ldd /usr/bin/curl
# 对比 strace vs ltrace 的关键区别
# strace: 内核态系统调用(read, write, open, fork...)
# ltrace: 用户态库函数(printf, malloc, strcmp, fopen...)
# 建议流程:先 strace 看内核层,再 ltrace 看库层
4. /proc Virtual Filesystem
/proc is a read-only (partially writable) virtual filesystem where the kernel exports runtime state. It is mounted in memory and uses no disk space. Each process has a /proc/PID/ directory; kernel-global information lives in top-level files like /proc/sys/.
Per-Process Directory /proc/PID/
| Path | Contents | Common Use |
|---|---|---|
| cmdline | Full command line (NUL-separated) | cat /proc/1234/cmdline |
| maps | Memory map regions (address/perms/file) | See loaded .so libraries |
| smaps | Detailed RSS/PSS/Shared stats per segment | Precise memory usage analysis |
| fd/ | All open fds (symlinks to actual files) | ls -la /proc/1234/fd |
| fdinfo/ | Offset and flags for each fd | Debug fd state |
| status | Process status summary (VmRSS/Threads/State/PPid) | grep VmRSS /proc/1234/status |
| stat | Machine-readable process stats (source for ps) | Script-parse process data |
| environ | Environment variables at launch (NUL-separated) | cat /proc/1234/environ |
| cgroup | cgroup hierarchy path for this process | Confirm container membership |
| net/tcp | TCP socket table for this process | See listening ports |
# 查看进程命令行
cat /proc/1234/cmdline | tr '\0' ' '; echo
# 统计进程打开的文件描述符数量
ls /proc/1234/fd | wc -l
# 查看每个 fd 指向的文件
ls -la /proc/1234/fd
# 查看内存占用(VmRSS = 实际物理内存)
grep -E "^(VmRSS|VmSize|VmSwap|Threads)" /proc/1234/status
# smaps_rollup: 快速汇总(内核 4.14+)
cat /proc/1234/smaps_rollup
# 查看环境变量
cat /proc/1234/environ | tr '\0' '\n' | grep PATH
# /proc/self 是当前 shell 的快捷方式
cat /proc/self/status | head -10
# 通过 maps 找所有加载的共享库
grep "\.so" /proc/1234/maps | awk '{print $6}' | sort -u
Global /proc Files
# 内存概要
cat /proc/meminfo
# CPU 信息(型号、核数、频率、缓存)
cat /proc/cpuinfo | grep -E "model name|cpu cores|cache size" | head -6
# 系统负载(1/5/15分钟均值,运行/总进程数,最近PID)
cat /proc/loadavg
# TCP 连接表(十六进制地址,需转换)
cat /proc/net/tcp
# 中断统计(每个 CPU 的中断次数)
cat /proc/interrupts | head -20
# 内核启动参数
cat /proc/cmdline
# 挂载信息
cat /proc/mounts
# 文件系统使用限额
cat /proc/sys/fs/file-nr # 已用/空闲/最大 fd 数
# /proc/sys 等同 sysctl 接口(可直接读写)
cat /proc/sys/net/ipv4/ip_forward
echo 1 > /proc/sys/net/ipv4/ip_forward # 临时开启路由转发
5. sysctl: Kernel Parameter Tuning
sysctl provides a unified interface to read and write kernel parameters under /proc/sys/. Parameters are hierarchical: net.ipv4.tcp_syncookies maps to /proc/sys/net/ipv4/tcp_syncookies. Changes can be temporary (immediate but lost on reboot) or permanent (written to config files).
# 查看所有参数
sysctl -a
# 查看单个参数
sysctl net.ipv4.tcp_max_syn_backlog
sysctl vm.swappiness
# 临时修改(重启后恢复默认值)
sysctl -w net.ipv4.ip_forward=1
sysctl -w fs.file-max=1048576
# 永久修改(写入配置文件,推荐用 /etc/sysctl.d/)
cat > /etc/sysctl.d/99-production.conf
## 6. /sys Device Tree (sysfs)
sysfs is mounted at `/sys` and exports kernel objects (devices, drivers, buses) to user space as a directory tree. Compared to /proc, sysfs enforces a stricter "one directory per object, one file per attribute" structure and is the foundation for udev rules and hardware management.
```bash
# 顶层结构
ls /sys/
# block bus class dev devices firmware fs kernel module power
# 查看所有块设备
ls /sys/block/
# 磁盘 sda 的信息
cat /sys/block/sda/size # 总扇区数
cat /sys/block/sda/queue/rotational # 0=SSD,1=HDD
cat /sys/block/sda/queue/scheduler # IO 调度器
cat /sys/block/sda/device/model # 硬盘型号
# 网络接口信息
cat /sys/class/net/eth0/speed # 速率 (Mbps)
cat /sys/class/net/eth0/operstate # up/down
cat /sys/class/net/eth0/statistics/rx_bytes # 接收字节数
# CPU 频率与省电
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# 修改 CPU 性能模式(需 root)
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# 调整 IO 调度器(对 SSD 推荐 none 或 mq-deadline)
echo mq-deadline > /sys/block/nvme0n1/queue/scheduler
# udev 规则(自动加载驱动、命名设备等)
ls /etc/udev/rules.d/
# udevadm info 查看设备属性(用于编写规则)
udevadm info --query=all --name=/dev/sda | head -20
# debugfs(需 root,挂载调试接口)
mount -t debugfs none /sys/kernel/debug
ls /sys/kernel/debug/tracing/ # ftrace 接口
7. Kernel Modules
Kernel modules (.ko files) are dynamically loadable/unloadable kernel code segments — no kernel recompilation or reboot required. Drivers, filesystems, and network protocols exist as modules. Loaded modules run in Ring 0; a crash will cause a kernel panic.
Module Management Commands
# 查看已加载模块(名称/大小/依赖数/依赖者)
lsmod
# 查看模块详细信息(版本/参数/依赖/描述)
modinfo ext4
modinfo nvidia
# 加载模块及其所有依赖(推荐)
modprobe nf_conntrack
# 加载模块并传递参数
modprobe usbhid quirks=0x1234:0x5678:0x0004
# 卸载模块(如果无进程使用)
modprobe -r nf_conntrack
# 直接加载 .ko 文件(不解析依赖,调试专用)
insmod /path/to/mymodule.ko
# 直接卸载(不检查依赖,慎用)
rmmod mymodule
# 查看模块参数
cat /sys/module/ext4/parameters/
ls /sys/module/nf_conntrack/parameters/
# 开机自动加载配置
cat /etc/modules-load.d/modules.conf
echo "nf_conntrack" >> /etc/modules-load.d/custom.conf
# 模块参数持久化配置
cat > /etc/modprobe.d/custom.conf
## 8. dmesg: Kernel Log
dmesg reads the kernel's ring buffer log, which contains hardware detection, driver messages, kernel warnings, and errors. After boot, logs are saved to `/var/log/kern.log` (rsyslog) or accessed via `journalctl -k` (systemd).
```bash
# 显示所有内核日志(带人类可读时间戳)
dmesg -T
# 只看警告和错误
dmesg -T -l warn,err
# 只看特定设施(kern=内核,user=用户态,daemon=守护进程)
dmesg -T -f kern
# 实时跟踪(类似 tail -f,内核 3.5+)
dmesg --follow
# 过滤关键字
dmesg -T | grep -i "oom\|killed\|out of memory"
dmesg -T | grep -i "error\|fail\|hardware"
dmesg -T | grep -i "sda\|nvme\|I/O error"
# 清除 ring buffer(慎用,需 root)
dmesg -C
# 常见内核错误解读:
# OOM killer(内存不足,内核强杀进程)
# Out of memory: Killed process 1234 (java) total-vm:4096kB, anon-rss:2048kB
# → 解决:增加内存,调整 vm.swappiness,或限制进程内存用量
# 硬件错误(磁盘坏块)
# end_request: I/O error, dev sda, sector 1234567
# blk_update_request: I/O error, dev sda, sector 1234567 op 0x0:(READ)
# → 解决:smartctl -a /dev/sda 检查 SMART 状态,尽快备份迁移数据
# 网络设备丢包
# eth0: Dropped oversize packet
# eth0: RX ring buffer not full
# → 解决:增大 /proc/sys/net/core/netdev_max_backlog
# 文件系统错误
# EXT4-fs error (device sda1): ext4_find_entry:1455: inode #2: comm bash
# → 解决:umount 后 fsck -y /dev/sda1
# 系统日志(journalctl)
journalctl -k # 本次启动的内核日志
journalctl -k -b -1 # 上次启动的内核日志
journalctl -k -p err # 只看内核错误级别
journalctl -k --since "1 hour ago" # 最近1小时内核日志
9. Kernel Debug Interfaces
SysRq
The SysRq key (Magic System Request Key) provides emergency direct-kernel control, potentially responsive even when the system appears frozen. Trigger from scripts via /proc/sysrq-trigger.
# 启用 SysRq(值 1=全部启用,438=选择性启用)
echo 1 > /proc/sys/kernel/sysrq
# 触发操作(写入 /proc/sysrq-trigger)
echo m > /proc/sysrq-trigger # 打印内存信息到 dmesg
echo t > /proc/sysrq-trigger # 打印所有线程状态
echo b > /proc/sysrq-trigger # 立即重启(不做任何清理)!
echo s > /proc/sysrq-trigger # sync 所有文件系统
echo u > /proc/sysrq-trigger # 以只读重新挂载所有文件系统
# 安全重启(sync + unmount + reboot):依次按 s u b
echo s > /proc/sysrq-trigger; sync
echo u > /proc/sysrq-trigger
echo b > /proc/sysrq-trigger
ftrace
ftrace is the kernel's built-in function tracer, exposed via debugfs. It can trace any kernel function call with extremely low overhead.
# 挂载 debugfs(通常已自动挂载)
mount -t debugfs none /sys/kernel/debug
cd /sys/kernel/debug/tracing
# 查看可用 tracer
cat available_tracers
# blk function function_graph wakeup nop
# 使用函数图 tracer(追踪函数调用树)
echo function_graph > current_tracer
# 只追踪特定函数
echo do_sys_open > set_ftrace_filter
# 开始追踪
echo 1 > tracing_on
# 查看结果
cat trace | head -50
# 停止追踪
echo 0 > tracing_on
echo nop > current_tracer
10. Practice: Investigating a File Descriptor Leak
Production scenario: a web service starts throwing "Too many open files" errors after running for a while, and HTTP requests begin failing. Here is the complete investigation chain:
## 步骤1:确认现象——找到进程 PID
systemctl status mywebapp
# 获得 PID,假设为 2341
## 步骤2:查看进程当前 fd 数量
ls /proc/2341/fd | wc -l
# 输出:65530 → 接近系统默认限制 65536
## 步骤3:查看进程 fd 限制(soft limit)
cat /proc/2341/limits | grep "open files"
# Max open files 65536 65536 files
## 步骤4:分析 fd 类型分布
ls -la /proc/2341/fd | awk '{print $NF}' | \
sed 's|/proc.*||' | sort | uniq -c | sort -rn | head -20
# 发现大量 /tmp/upload-XXXXXX 临时文件条目
## 步骤5:用 strace 确认——是否有只 open 不 close
strace -e trace=openat,close -c -p 2341
# 统计模式输出:
# calls: openat=1000, close=10 → open/close 严重不平衡
## 步骤6:lsof 查看具体泄漏文件
lsof -p 2341 | grep "/tmp/upload" | head -20
# 发现大量上传临时文件 fd 未关闭
## 步骤7:sysctl 查看系统级 fd 使用情况
cat /proc/sys/fs/file-nr
# 已用/空闲/最大:800000 0 1048576
## 步骤8:临时缓解——提升限制(不修复根因,仅争取时间)
# /etc/security/limits.conf 或 systemd service 的 LimitNOFILE
cat >> /etc/security/limits.d/mywebapp.conf 50000 告警
Investigation Summary: File descriptor leak investigation path: /proc/PID/fd count → limits check → strace -c unbalanced open/close → lsof file identification → /proc/sys/fs/file-nr system view → code fix → monitoring alert. Core tool combination:
/proc+strace+lsof+sysctl, covering the full chain from kernel to code.
Previous
← Ch16: Containers
Next
Ch18: Mini Shell →