Chapter 17

System Calls

Chapter 17: Syscalls and Kernel Interface

System calls are the only legitimate boundary between user programs and the kernel โ€” every open(), read(), and write() involves a privilege-level switch. This chapter starts from the x86_64 ABI, then covers full strace call-chain analysis, the /proc virtual filesystem, sysctl kernel parameter tuning, the /sys device tree, kernel module management, and dmesg log interpretation โ€” all tied together by a complete production "file descriptor leak" investigation.

1. Syscall Internals

User Mode vs Kernel Mode

Modern CPUs isolate user code from kernel code via privilege rings (x86 calls them Rings). User programs run in Ring 3 (lowest privilege); the kernel runs in Ring 0 (highest privilege). Ring 3 code cannot directly access hardware or other processes' memory โ€” it must trap into the kernel via a system call. This mechanism is both a security boundary and a performance cost: each syscall averages 100โ€“300 ns (register save, TLB flush, context switch, etc.).

Two Syscall Trigger Mechanisms

Instruction When Used Performance
int 0x80 Legacy 32-bit x86, traps via software interrupt Slower (~1 ยตs)
syscall x86_64 fast path, switches via MSR registers directly Fast (~100โ€“300 ns)
VDSO Kernel maps some calls (e.g. gettimeofday) into user address space โ€” no kernel entry at all Extremely fast (nanoseconds)

x86_64 Syscall ABI Convention

x86_64 Linux uses the following register convention to pass the syscall number and arguments (defined in /usr/include/asm/unistd_64.h):

# ๆŸฅ็œ‹ๆ‰€ๆœ‰็ณป็ปŸ่ฐƒ็”จๅท๏ผˆx86_64๏ผ‰
grep -r "^#define __NR_" /usr/include/asm/unistd_64.h | head -20

# ็”จ ausyscall ๆŸฅ่ฏข๏ผˆaudit ๅŒ…ๆไพ›๏ผ‰
ausyscall --dump | grep -E "^(0|1|2|3|60)"

# ็ณป็ปŸ่ฐƒ็”จๆ€ปๆ•ฐ
grep -c "^#define __NR_" /usr/include/asm/unistd_64.h

# ๆŸฅ็œ‹ VDSO ๆ˜ ๅฐ„๏ผˆ/proc/self/maps ไธญ [vdso] ่กŒ๏ผ‰
cat /proc/self/maps | grep vdso
# 7fff12345000-7fff12346000 r-xp 00000000 00:00 0  [vdso]

2. strace Deep Dive

strace uses ptrace(2) to pause the target process at every kernel boundary, recording arguments and return values. Warning: strace slows the target process by 10โ€“100x. In production prefer -c summary mode or use bpftrace instead.

Common Options Quick Reference

Option Meaning
-p PID Attach to running process
-tt Show microsecond-precision absolute timestamps
-T Show time spent in each call at end of line
-e trace=TYPE Filter by category (file/network/process/signal/ipc/desc)
-e trace=open,read Trace only specific call names
-c Summary mode: count calls and total time (lowest overhead)
-f Follow child processes created by fork/clone
-o FILE Write output to file instead of stderr
-s 256 String truncation length (default 32; increase to see full paths)
-y Print fd numbers as file paths (more readable)
# ๅŸบ็ก€๏ผš่ทŸ่ธชๆ–ฐๅฏๅŠจๅ‘ฝไปค็š„ๆ‰€ๆœ‰็ณป็ปŸ่ฐƒ็”จ
strace ls /tmp

# ้™„ๅŠ ๅˆฐๅทฒๆœ‰่ฟ›็จ‹๏ผˆCtrl+C ๅœๆญข๏ผŒไธๅฝฑๅ“่ฟ›็จ‹๏ผ‰
strace -p 1234

# ๆ—ถ้—ดๆˆณ + ่€—ๆ—ถ๏ผˆๆŽ’ๆŸฅๅฏๅŠจๆ…ข๏ผ‰
strace -tt -T -p 1234 2>&1 | head -50

# ๅช็œ‹ๆ–‡ไปถ็ฑป่ฐƒ็”จ๏ผˆopen/stat/read/write/close๏ผ‰
strace -e trace=file -p 1234

# ๅช็œ‹็ฝ‘็ปœ่ฐƒ็”จ๏ผˆconnect/bind/accept/send/recv๏ผ‰
strace -e trace=network -p 1234

# ็ปŸ่ฎกๆจกๅผโ€”โ€”็”ŸไบงๆŽ’ๆŸฅๆŽจ่
strace -c -p 1234
# ๆ ทไพ‹่พ“ๅ‡บ๏ผš
# % time     seconds  usecs/call     calls    errors syscall
# -------  ----------- ----------- --------- --------- ----------------
#  52.13    0.002341          23       100         0 epoll_wait
#  21.30    0.000957           9       100         0 read
#   8.44    0.000379           7        54        12 openat
#   7.23    0.000325           3        90         0 write

# ่ทŸ่ธชๅญ่ฟ›็จ‹๏ผˆ-ff ๆฏไธชๅญ่ฟ›็จ‹ๅ†™็‹ฌ็ซ‹ๆ–‡ไปถ๏ผ‰
strace -f -o /tmp/strace.out nginx

# ๆŽ’ๆŸฅ็จ‹ๅบๅฏๅŠจๆ…ข๏ผšๆ‰พๅคง้‡ stat() ่ฐƒ็”จ
strace -tt -T -e trace=stat,statx,lstat ./slow_app 2>&1 | \
  awk '{ sum += $NF; n++ } END { print n, "calls, total:", sum, "s" }'

# ๆŽ’ๆŸฅๆ–‡ไปถๆƒ้™้—ฎ้ข˜๏ผš็œ‹ openat ่ฟ”ๅ›ž็š„ EACCES/ENOENT
strace -e trace=openat -s 256 -y ./myapp 2>&1 | grep -E "EACCES|ENOENT"

# ๆŽ’ๆŸฅ็ฝ‘็ปœ่ฟžๆŽฅๅคฑ่ดฅ
strace -e trace=connect -p 1234 2>&1 | grep -E "ECONNREFUSED|ETIMEDOUT"

3. ltrace: Library Call Tracing

ltrace operates in user space, intercepting dynamic library function calls via PLT (Procedure Linkage Table) hooks โ€” no kernel entry required. It complements strace for analyzing malloc/free memory allocation, libc string operations, and other user-space behavior.

# ่ทŸ่ธชๅŠจๆ€ๅบ“่ฐƒ็”จ
ltrace ./myapp

# ้™„ๅŠ ๅˆฐ่ฟ่กŒไธญ่ฟ›็จ‹
ltrace -p 1234

# ๅช่ทŸ่ธชๅ†…ๅญ˜ๅˆ†้…๏ผˆๆŽ’ๆŸฅๅ†…ๅญ˜ๆณ„ๆผ๏ผ‰
ltrace -e malloc,free,realloc,calloc -p 1234

# ็ปŸ่ฎก่ฐƒ็”จๆฌกๆ•ฐไธŽ่€—ๆ—ถ
ltrace -c ./myapp

# ๅŒๆ—ถๆ˜พ็คบ็ณป็ปŸ่ฐƒ็”จ๏ผˆ-S๏ผ‰
ltrace -S ./myapp

# ๆŸฅ็œ‹็จ‹ๅบ็š„ๅŠจๆ€ๅบ“ไพ่ต–
ldd /usr/bin/curl

# ๅฏนๆฏ” strace vs ltrace ็š„ๅ…ณ้”ฎๅŒบๅˆซ
# strace: ๅ†…ๆ ธๆ€็ณป็ปŸ่ฐƒ็”จ๏ผˆread, write, open, fork...๏ผ‰
# ltrace: ็”จๆˆทๆ€ๅบ“ๅ‡ฝๆ•ฐ๏ผˆprintf, malloc, strcmp, fopen...๏ผ‰
# ๅปบ่ฎฎๆต็จ‹๏ผšๅ…ˆ strace ็œ‹ๅ†…ๆ ธๅฑ‚๏ผŒๅ† ltrace ็œ‹ๅบ“ๅฑ‚

4. /proc Virtual Filesystem

/proc is a read-only (partially writable) virtual filesystem where the kernel exports runtime state. It is mounted in memory and uses no disk space. Each process has a /proc/PID/ directory; kernel-global information lives in top-level files like /proc/sys/.

Per-Process Directory /proc/PID/

Path Contents Common Use
cmdline Full command line (NUL-separated) cat /proc/1234/cmdline
maps Memory map regions (address/perms/file) See loaded .so libraries
smaps Detailed RSS/PSS/Shared stats per segment Precise memory usage analysis
fd/ All open fds (symlinks to actual files) ls -la /proc/1234/fd
fdinfo/ Offset and flags for each fd Debug fd state
status Process status summary (VmRSS/Threads/State/PPid) grep VmRSS /proc/1234/status
stat Machine-readable process stats (source for ps) Script-parse process data
environ Environment variables at launch (NUL-separated) cat /proc/1234/environ
cgroup cgroup hierarchy path for this process Confirm container membership
net/tcp TCP socket table for this process See listening ports
# ๆŸฅ็œ‹่ฟ›็จ‹ๅ‘ฝไปค่กŒ
cat /proc/1234/cmdline | tr '\0' ' '; echo

# ็ปŸ่ฎก่ฟ›็จ‹ๆ‰“ๅผ€็š„ๆ–‡ไปถๆ่ฟฐ็ฌฆๆ•ฐ้‡
ls /proc/1234/fd | wc -l

# ๆŸฅ็œ‹ๆฏไธช fd ๆŒ‡ๅ‘็š„ๆ–‡ไปถ
ls -la /proc/1234/fd

# ๆŸฅ็œ‹ๅ†…ๅญ˜ๅ ็”จ๏ผˆVmRSS = ๅฎž้™…็‰ฉ็†ๅ†…ๅญ˜๏ผ‰
grep -E "^(VmRSS|VmSize|VmSwap|Threads)" /proc/1234/status

# smaps_rollup: ๅฟซ้€Ÿๆฑ‡ๆ€ป๏ผˆๅ†…ๆ ธ 4.14+๏ผ‰
cat /proc/1234/smaps_rollup

# ๆŸฅ็œ‹็Žฏๅขƒๅ˜้‡
cat /proc/1234/environ | tr '\0' '\n' | grep PATH

# /proc/self ๆ˜ฏๅฝ“ๅ‰ shell ็š„ๅฟซๆทๆ–นๅผ
cat /proc/self/status | head -10

# ้€š่ฟ‡ maps ๆ‰พๆ‰€ๆœ‰ๅŠ ่ฝฝ็š„ๅ…ฑไบซๅบ“
grep "\.so" /proc/1234/maps | awk '{print $6}' | sort -u

Global /proc Files

# ๅ†…ๅญ˜ๆฆ‚่ฆ
cat /proc/meminfo

# CPU ไฟกๆฏ๏ผˆๅž‹ๅทใ€ๆ ธๆ•ฐใ€้ข‘็އใ€็ผ“ๅญ˜๏ผ‰
cat /proc/cpuinfo | grep -E "model name|cpu cores|cache size" | head -6

# ็ณป็ปŸ่ดŸ่ฝฝ๏ผˆ1/5/15ๅˆ†้’Ÿๅ‡ๅ€ผ๏ผŒ่ฟ่กŒ/ๆ€ป่ฟ›็จ‹ๆ•ฐ๏ผŒๆœ€่ฟ‘PID๏ผ‰
cat /proc/loadavg

# TCP ่ฟžๆŽฅ่กจ๏ผˆๅๅ…ญ่ฟ›ๅˆถๅœฐๅ€๏ผŒ้œ€่ฝฌๆข๏ผ‰
cat /proc/net/tcp

# ไธญๆ–ญ็ปŸ่ฎก๏ผˆๆฏไธช CPU ็š„ไธญๆ–ญๆฌกๆ•ฐ๏ผ‰
cat /proc/interrupts | head -20

# ๅ†…ๆ ธๅฏๅŠจๅ‚ๆ•ฐ
cat /proc/cmdline

# ๆŒ‚่ฝฝไฟกๆฏ
cat /proc/mounts

# ๆ–‡ไปถ็ณป็ปŸไฝฟ็”จ้™้ข
cat /proc/sys/fs/file-nr   # ๅทฒ็”จ/็ฉบ้—ฒ/ๆœ€ๅคง fd ๆ•ฐ

# /proc/sys ็ญ‰ๅŒ sysctl ๆŽฅๅฃ๏ผˆๅฏ็›ดๆŽฅ่ฏปๅ†™๏ผ‰
cat /proc/sys/net/ipv4/ip_forward
echo 1 > /proc/sys/net/ipv4/ip_forward  # ไธดๆ—ถๅผ€ๅฏ่ทฏ็”ฑ่ฝฌๅ‘

5. sysctl: Kernel Parameter Tuning

sysctl provides a unified interface to read and write kernel parameters under /proc/sys/. Parameters are hierarchical: net.ipv4.tcp_syncookies maps to /proc/sys/net/ipv4/tcp_syncookies. Changes can be temporary (immediate but lost on reboot) or permanent (written to config files).

# ๆŸฅ็œ‹ๆ‰€ๆœ‰ๅ‚ๆ•ฐ
sysctl -a

# ๆŸฅ็œ‹ๅ•ไธชๅ‚ๆ•ฐ
sysctl net.ipv4.tcp_max_syn_backlog
sysctl vm.swappiness

# ไธดๆ—ถไฟฎๆ”น๏ผˆ้‡ๅฏๅŽๆขๅค้ป˜่ฎคๅ€ผ๏ผ‰
sysctl -w net.ipv4.ip_forward=1
sysctl -w fs.file-max=1048576

# ๆฐธไน…ไฟฎๆ”น๏ผˆๅ†™ๅ…ฅ้…็ฝฎๆ–‡ไปถ๏ผŒๆŽจ่็”จ /etc/sysctl.d/๏ผ‰
cat > /etc/sysctl.d/99-production.conf 
  
## 6. /sys Device Tree (sysfs)


  
sysfs is mounted at `/sys` and exports kernel objects (devices, drivers, buses) to user space as a directory tree. Compared to /proc, sysfs enforces a stricter "one directory per object, one file per attribute" structure and is the foundation for udev rules and hardware management.


  
```bash
# ้กถๅฑ‚็ป“ๆž„
ls /sys/
# block  bus  class  dev  devices  firmware  fs  kernel  module  power

# ๆŸฅ็œ‹ๆ‰€ๆœ‰ๅ—่ฎพๅค‡
ls /sys/block/

# ็ฃ็›˜ sda ็š„ไฟกๆฏ
cat /sys/block/sda/size          # ๆ€ปๆ‰‡ๅŒบๆ•ฐ
cat /sys/block/sda/queue/rotational  # 0=SSD๏ผŒ1=HDD
cat /sys/block/sda/queue/scheduler   # IO ่ฐƒๅบฆๅ™จ
cat /sys/block/sda/device/model      # ็กฌ็›˜ๅž‹ๅท

# ็ฝ‘็ปœๆŽฅๅฃไฟกๆฏ
cat /sys/class/net/eth0/speed    # ้€Ÿ็އ (Mbps)
cat /sys/class/net/eth0/operstate # up/down
cat /sys/class/net/eth0/statistics/rx_bytes  # ๆŽฅๆ”ถๅญ—่Š‚ๆ•ฐ

# CPU ้ข‘็އไธŽ็œ็”ต
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# ไฟฎๆ”น CPU ๆ€ง่ƒฝๆจกๅผ๏ผˆ้œ€ root๏ผ‰
echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# ่ฐƒๆ•ด IO ่ฐƒๅบฆๅ™จ๏ผˆๅฏน SSD ๆŽจ่ none ๆˆ– mq-deadline๏ผ‰
echo mq-deadline > /sys/block/nvme0n1/queue/scheduler

# udev ่ง„ๅˆ™๏ผˆ่‡ชๅŠจๅŠ ่ฝฝ้ฉฑๅŠจใ€ๅ‘ฝๅ่ฎพๅค‡็ญ‰๏ผ‰
ls /etc/udev/rules.d/
# udevadm info ๆŸฅ็œ‹่ฎพๅค‡ๅฑžๆ€ง๏ผˆ็”จไบŽ็ผ–ๅ†™่ง„ๅˆ™๏ผ‰
udevadm info --query=all --name=/dev/sda | head -20

# debugfs๏ผˆ้œ€ root๏ผŒๆŒ‚่ฝฝ่ฐƒ่ฏ•ๆŽฅๅฃ๏ผ‰
mount -t debugfs none /sys/kernel/debug
ls /sys/kernel/debug/tracing/   # ftrace ๆŽฅๅฃ

7. Kernel Modules

Kernel modules (.ko files) are dynamically loadable/unloadable kernel code segments โ€” no kernel recompilation or reboot required. Drivers, filesystems, and network protocols exist as modules. Loaded modules run in Ring 0; a crash will cause a kernel panic.

Module Management Commands

# ๆŸฅ็œ‹ๅทฒๅŠ ่ฝฝๆจกๅ—๏ผˆๅ็งฐ/ๅคงๅฐ/ไพ่ต–ๆ•ฐ/ไพ่ต–่€…๏ผ‰
lsmod

# ๆŸฅ็œ‹ๆจกๅ—่ฏฆ็ป†ไฟกๆฏ๏ผˆ็‰ˆๆœฌ/ๅ‚ๆ•ฐ/ไพ่ต–/ๆ่ฟฐ๏ผ‰
modinfo ext4
modinfo nvidia

# ๅŠ ่ฝฝๆจกๅ—ๅŠๅ…ถๆ‰€ๆœ‰ไพ่ต–๏ผˆๆŽจ่๏ผ‰
modprobe nf_conntrack

# ๅŠ ่ฝฝๆจกๅ—ๅนถไผ ้€’ๅ‚ๆ•ฐ
modprobe usbhid quirks=0x1234:0x5678:0x0004

# ๅธ่ฝฝๆจกๅ—๏ผˆๅฆ‚ๆžœๆ— ่ฟ›็จ‹ไฝฟ็”จ๏ผ‰
modprobe -r nf_conntrack

# ็›ดๆŽฅๅŠ ่ฝฝ .ko ๆ–‡ไปถ๏ผˆไธ่งฃๆžไพ่ต–๏ผŒ่ฐƒ่ฏ•ไธ“็”จ๏ผ‰
insmod /path/to/mymodule.ko

# ็›ดๆŽฅๅธ่ฝฝ๏ผˆไธๆฃ€ๆŸฅไพ่ต–๏ผŒๆ…Ž็”จ๏ผ‰
rmmod mymodule

# ๆŸฅ็œ‹ๆจกๅ—ๅ‚ๆ•ฐ
cat /sys/module/ext4/parameters/
ls /sys/module/nf_conntrack/parameters/

# ๅผ€ๆœบ่‡ชๅŠจๅŠ ่ฝฝ้…็ฝฎ
cat /etc/modules-load.d/modules.conf
echo "nf_conntrack" >> /etc/modules-load.d/custom.conf

# ๆจกๅ—ๅ‚ๆ•ฐๆŒไน…ๅŒ–้…็ฝฎ
cat > /etc/modprobe.d/custom.conf 
  
## 8. dmesg: Kernel Log


  
dmesg reads the kernel's ring buffer log, which contains hardware detection, driver messages, kernel warnings, and errors. After boot, logs are saved to `/var/log/kern.log` (rsyslog) or accessed via `journalctl -k` (systemd).


  
```bash
# ๆ˜พ็คบๆ‰€ๆœ‰ๅ†…ๆ ธๆ—ฅๅฟ—๏ผˆๅธฆไบบ็ฑปๅฏ่ฏปๆ—ถ้—ดๆˆณ๏ผ‰
dmesg -T

# ๅช็œ‹่ญฆๅ‘Šๅ’Œ้”™่ฏฏ
dmesg -T -l warn,err

# ๅช็œ‹็‰นๅฎš่ฎพๆ–ฝ๏ผˆkern=ๅ†…ๆ ธ๏ผŒuser=็”จๆˆทๆ€๏ผŒdaemon=ๅฎˆๆŠค่ฟ›็จ‹๏ผ‰
dmesg -T -f kern

# ๅฎžๆ—ถ่ทŸ่ธช๏ผˆ็ฑปไผผ tail -f๏ผŒๅ†…ๆ ธ 3.5+๏ผ‰
dmesg --follow

# ่ฟ‡ๆปคๅ…ณ้”ฎๅญ—
dmesg -T | grep -i "oom\|killed\|out of memory"
dmesg -T | grep -i "error\|fail\|hardware"
dmesg -T | grep -i "sda\|nvme\|I/O error"

# ๆธ…้™ค ring buffer๏ผˆๆ…Ž็”จ๏ผŒ้œ€ root๏ผ‰
dmesg -C

# ๅธธ่งๅ†…ๆ ธ้”™่ฏฏ่งฃ่ฏป๏ผš

# OOM killer๏ผˆๅ†…ๅญ˜ไธ่ถณ๏ผŒๅ†…ๆ ธๅผบๆ€่ฟ›็จ‹๏ผ‰
# Out of memory: Killed process 1234 (java) total-vm:4096kB, anon-rss:2048kB
# โ†’ ่งฃๅ†ณ๏ผšๅขžๅŠ ๅ†…ๅญ˜๏ผŒ่ฐƒๆ•ด vm.swappiness๏ผŒๆˆ–้™ๅˆถ่ฟ›็จ‹ๅ†…ๅญ˜็”จ้‡

# ็กฌไปถ้”™่ฏฏ๏ผˆ็ฃ็›˜ๅๅ—๏ผ‰
# end_request: I/O error, dev sda, sector 1234567
# blk_update_request: I/O error, dev sda, sector 1234567 op 0x0:(READ)
# โ†’ ่งฃๅ†ณ๏ผšsmartctl -a /dev/sda ๆฃ€ๆŸฅ SMART ็Šถๆ€๏ผŒๅฐฝๅฟซๅค‡ไปฝ่ฟ็งปๆ•ฐๆฎ

# ็ฝ‘็ปœ่ฎพๅค‡ไธขๅŒ…
# eth0: Dropped oversize packet
# eth0: RX ring buffer not full
# โ†’ ่งฃๅ†ณ๏ผšๅขžๅคง /proc/sys/net/core/netdev_max_backlog

# ๆ–‡ไปถ็ณป็ปŸ้”™่ฏฏ
# EXT4-fs error (device sda1): ext4_find_entry:1455: inode #2: comm bash
# โ†’ ่งฃๅ†ณ๏ผšumount ๅŽ fsck -y /dev/sda1

# ็ณป็ปŸๆ—ฅๅฟ—๏ผˆjournalctl๏ผ‰
journalctl -k           # ๆœฌๆฌกๅฏๅŠจ็š„ๅ†…ๆ ธๆ—ฅๅฟ—
journalctl -k -b -1     # ไธŠๆฌกๅฏๅŠจ็š„ๅ†…ๆ ธๆ—ฅๅฟ—
journalctl -k -p err    # ๅช็œ‹ๅ†…ๆ ธ้”™่ฏฏ็บงๅˆซ
journalctl -k --since "1 hour ago"  # ๆœ€่ฟ‘1ๅฐๆ—ถๅ†…ๆ ธๆ—ฅๅฟ—

9. Kernel Debug Interfaces

SysRq

The SysRq key (Magic System Request Key) provides emergency direct-kernel control, potentially responsive even when the system appears frozen. Trigger from scripts via /proc/sysrq-trigger.

# ๅฏ็”จ SysRq๏ผˆๅ€ผ 1=ๅ…จ้ƒจๅฏ็”จ๏ผŒ438=้€‰ๆ‹ฉๆ€งๅฏ็”จ๏ผ‰
echo 1 > /proc/sys/kernel/sysrq

# ่งฆๅ‘ๆ“ไฝœ๏ผˆๅ†™ๅ…ฅ /proc/sysrq-trigger๏ผ‰
echo m > /proc/sysrq-trigger    # ๆ‰“ๅฐๅ†…ๅญ˜ไฟกๆฏๅˆฐ dmesg
echo t > /proc/sysrq-trigger    # ๆ‰“ๅฐๆ‰€ๆœ‰็บฟ็จ‹็Šถๆ€
echo b > /proc/sysrq-trigger    # ็ซ‹ๅณ้‡ๅฏ๏ผˆไธๅšไปปไฝ•ๆธ…็†๏ผ‰๏ผ
echo s > /proc/sysrq-trigger    # sync ๆ‰€ๆœ‰ๆ–‡ไปถ็ณป็ปŸ
echo u > /proc/sysrq-trigger    # ไปฅๅช่ฏป้‡ๆ–ฐๆŒ‚่ฝฝๆ‰€ๆœ‰ๆ–‡ไปถ็ณป็ปŸ

# ๅฎ‰ๅ…จ้‡ๅฏ๏ผˆsync + unmount + reboot๏ผ‰๏ผšไพๆฌกๆŒ‰ s u b
echo s > /proc/sysrq-trigger; sync
echo u > /proc/sysrq-trigger
echo b > /proc/sysrq-trigger

ftrace

ftrace is the kernel's built-in function tracer, exposed via debugfs. It can trace any kernel function call with extremely low overhead.

# ๆŒ‚่ฝฝ debugfs๏ผˆ้€šๅธธๅทฒ่‡ชๅŠจๆŒ‚่ฝฝ๏ผ‰
mount -t debugfs none /sys/kernel/debug

cd /sys/kernel/debug/tracing

# ๆŸฅ็œ‹ๅฏ็”จ tracer
cat available_tracers
# blk function function_graph wakeup nop

# ไฝฟ็”จๅ‡ฝๆ•ฐๅ›พ tracer๏ผˆ่ฟฝ่ธชๅ‡ฝๆ•ฐ่ฐƒ็”จๆ ‘๏ผ‰
echo function_graph > current_tracer

# ๅช่ฟฝ่ธช็‰นๅฎšๅ‡ฝๆ•ฐ
echo do_sys_open > set_ftrace_filter

# ๅผ€ๅง‹่ฟฝ่ธช
echo 1 > tracing_on

# ๆŸฅ็œ‹็ป“ๆžœ
cat trace | head -50

# ๅœๆญข่ฟฝ่ธช
echo 0 > tracing_on
echo nop > current_tracer

10. Practice: Investigating a File Descriptor Leak

Production scenario: a web service starts throwing "Too many open files" errors after running for a while, and HTTP requests begin failing. Here is the complete investigation chain:

## ๆญฅ้ชค1๏ผš็กฎ่ฎค็Žฐ่ฑกโ€”โ€”ๆ‰พๅˆฐ่ฟ›็จ‹ PID
systemctl status mywebapp
# ่Žทๅพ— PID๏ผŒๅ‡่ฎพไธบ 2341

## ๆญฅ้ชค2๏ผšๆŸฅ็œ‹่ฟ›็จ‹ๅฝ“ๅ‰ fd ๆ•ฐ้‡
ls /proc/2341/fd | wc -l
# ่พ“ๅ‡บ๏ผš65530  โ†’ ๆŽฅ่ฟ‘็ณป็ปŸ้ป˜่ฎค้™ๅˆถ 65536

## ๆญฅ้ชค3๏ผšๆŸฅ็œ‹่ฟ›็จ‹ fd ้™ๅˆถ๏ผˆsoft limit๏ผ‰
cat /proc/2341/limits | grep "open files"
# Max open files         65536      65536       files

## ๆญฅ้ชค4๏ผšๅˆ†ๆž fd ็ฑปๅž‹ๅˆ†ๅธƒ
ls -la /proc/2341/fd | awk '{print $NF}' | \
  sed 's|/proc.*||' | sort | uniq -c | sort -rn | head -20
# ๅ‘็Žฐๅคง้‡ /tmp/upload-XXXXXX ไธดๆ—ถๆ–‡ไปถๆก็›ฎ

## ๆญฅ้ชค5๏ผš็”จ strace ็กฎ่ฎคโ€”โ€”ๆ˜ฏๅฆๆœ‰ๅช open ไธ close
strace -e trace=openat,close -c -p 2341
# ็ปŸ่ฎกๆจกๅผ่พ“ๅ‡บ๏ผš
# calls: openat=1000, close=10 โ†’ open/close ไธฅ้‡ไธๅนณ่กก

## ๆญฅ้ชค6๏ผšlsof ๆŸฅ็œ‹ๅ…ทไฝ“ๆณ„ๆผๆ–‡ไปถ
lsof -p 2341 | grep "/tmp/upload" | head -20
# ๅ‘็Žฐๅคง้‡ไธŠไผ ไธดๆ—ถๆ–‡ไปถ fd ๆœชๅ…ณ้—ญ

## ๆญฅ้ชค7๏ผšsysctl ๆŸฅ็œ‹็ณป็ปŸ็บง fd ไฝฟ็”จๆƒ…ๅ†ต
cat /proc/sys/fs/file-nr
# ๅทฒ็”จ/็ฉบ้—ฒ/ๆœ€ๅคง๏ผš800000 0 1048576

## ๆญฅ้ชค8๏ผšไธดๆ—ถ็ผ“่งฃโ€”โ€”ๆๅ‡้™ๅˆถ๏ผˆไธไฟฎๅคๆ นๅ› ๏ผŒไป…ไบ‰ๅ–ๆ—ถ้—ด๏ผ‰
# /etc/security/limits.conf ๆˆ– systemd service ็š„ LimitNOFILE
cat >> /etc/security/limits.d/mywebapp.conf  50000 ๅ‘Š่ญฆ

Investigation Summary: File descriptor leak investigation path: /proc/PID/fd count โ†’ limits check โ†’ strace -c unbalanced open/close โ†’ lsof file identification โ†’ /proc/sys/fs/file-nr system view โ†’ code fix โ†’ monitoring alert. Core tool combination: /proc + strace + lsof + sysctl, covering the full chain from kernel to code.

  Previous
  โ† Ch16: Containers


  Next
  Ch18: Mini Shell โ†’
Rate this chapter
4.9  / 5  (12 ratings)

๐Ÿ’ฌ Comments