综合实战
第20章:综合实战——生产级运维脚本系统
本章将全书所有技能融合为一个完整的生产级运维脚本系统。这个系统涵盖服务健康监控、资源告警、日志分析、自动化部署、备份恢复、Webhook 通知与 systemd 集成,并配有 bats 自动化测试。每一行代码都对应前面某一章的知识点。完成本章后,你将拥有一套可以直接在生产环境使用的运维基础设施。
1. 项目概述与架构
运维脚本系统的核心设计原则:单一职责(每个脚本只做一件事)、可观测(统一日志格式)、幂等(重复执行不产生副作用)、可测试(关键逻辑有 bats 测试覆盖)。
| 模块 | 功能 | 对应章节 |
|---|---|---|
| lib/common.sh | 公共函数库 | Ch9, Ch10, Ch11 |
| monitor/health_check.sh | HTTP/TCP/进程监控 | Ch7, Ch5 |
| monitor/resource_alert.sh | CPU/内存/磁盘告警 | Ch14, Ch8 |
| monitor/log_analyzer.sh | 日志关键字扫描 | Ch4, Ch11 |
| alert/webhook.sh | 钉钉/飞书告警通知 | Ch7, Ch12 |
| deploy/rolling_deploy.sh | 滚动部署与回滚 | Ch5, Ch12 |
| backup/backup.sh | rsync+加密备份 | Ch8, Ch15 |
| systemd/ | systemd service/timer | Ch13 |
| tests/ | bats 自动化测试 | Ch12 |
2. 项目目录结构
ops-scripts/ ├── lib/ │ └── common.sh # 公共函数库 ├── monitor/ │ ├── health_check.sh # HTTP/TCP/进程检查 │ ├── resource_alert.sh # CPU/内存/磁盘阈值告警 │ └── log_analyzer.sh # 日志关键字分析 ├── alert/ │ └── webhook.sh # 钉钉/飞书/邮件通知 ├── deploy/ │ └── rolling_deploy.sh # 滚动重启与回滚 ├── backup/ │ ├── backup.sh # rsync 快照备份+加密 │ └── restore.sh # 恢复脚本 ├── systemd/ │ ├── ops-monitor.service │ ├── ops-monitor.timer │ └── ops-backup.service ├── tests/ │ ├── test_common.bats │ ├── test_webhook.bats │ └── test_backup.bats ├── config.env # 配置文件(不提交到 git) └── config.env.example # 配置模板
3. 公共函数库(lib/common.sh)
所有脚本通过 source "$(dirname "$0")/../lib/common.sh" 引入公共库。库提供统一的日志格式、依赖检查、互斥锁和 HTTP 工具函数。
#!/usr/bin/env bash
# lib/common.sh — 公共函数库 / shared function library
set -euo pipefail
# ── 颜色常量(仅 tty 时启用)────────────────────────────
if [[ -t 2 ]]; then
RED='\033[0;31m'; YELLOW='\033[1;33m'
GREEN='\033[0;32m'; CYAN='\033[0;36m'
NC='\033[0m'
else
RED=''; YELLOW=''; GREEN=''; CYAN=''; NC=''
fi
# ── 日志函数 ─────────────────────────────────────────────
LOG_FILE="${LOG_FILE:-/var/log/ops-scripts/ops.log}"
_log() {
local level="$1"; shift
local ts
ts=$(date '+%Y-%m-%dT%H:%M:%S%z')
# 同时输出到 stderr 和日志文件
printf '%s [%s] %s\n' "$ts" "$level" "$*" | tee -a "$LOG_FILE" >&2
}
log_info() { _log "INFO " "${CYAN}$*${NC}"; }
log_warn() { _log "WARN " "${YELLOW}$*${NC}"; }
log_error() { _log "ERROR" "${RED}$*${NC}"; }
die() {
log_error "$*"
exit 1
}
# ── 依赖检查 ─────────────────────────────────────────────
require_cmd() {
local cmd
for cmd in "$@"; do
command -v "$cmd" &>/dev/null || die "Required command not found: $cmd"
done
}
# ── 互斥锁 ───────────────────────────────────────────────
LOCK_DIR="${LOCK_DIR:-/tmp/ops-locks}"
mkdir -p "$LOCK_DIR"
lock_file() {
local name="$1"
local lock="$LOCK_DIR/${name}.lock"
if ! mkdir "$lock" 2>/dev/null; then
local pid
pid=$(cat "$lock/pid" 2>/dev/null || echo "unknown")
die "Lock held by PID $pid: $lock"
fi
echo $$ > "$lock/pid"
# 注册退出时自动释放锁
trap "unlock_file '$name'" EXIT INT TERM
}
unlock_file() {
local name="$1"
rm -rf "$LOCK_DIR/${name}.lock"
}
# ── HTTP 工具 ─────────────────────────────────────────────
# http_post URL JSON_BODY — 发送 POST 请求,返回 HTTP 状态码
http_post() {
local url="$1"
local body="$2"
curl -s -o /dev/null -w "%{http_code}" \
-H 'Content-Type: application/json' \
--data "$body" \
--max-time 10 \
"$url"
}
# ── 配置加载 ──────────────────────────────────────────────
load_config() {
local cfg="${1:-config.env}"
[[ -f "$cfg" ]] || die "Config file not found: $cfg"
# 安全加载:只允许 KEY=VALUE 格式,过滤注释和空行
while IFS='=' read -r key value; do
[[ "$key" =~ ^[A-Za-z_][A-Za-z0-9_]*$ ]] || continue
export "$key"="$value"
done
## 4. 服务健康监控(monitor/health_check.sh)
健康检查脚本支持三种检查方式:HTTP 状态码检查、TCP 端口可达性检查、进程存活检查。配置来自 `config.env`,检查失败时调用 `alert/webhook.sh` 发送告警。
```bash
#!/usr/bin/env bash
# monitor/health_check.sh
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"
require_cmd curl nc pgrep
ALERT_SCRIPT="$SCRIPT_DIR/../alert/webhook.sh"
FAIL_COUNT=0
# ── HTTP 检查 ─────────────────────────────────────────────
check_http() {
local name="$1" url="$2" expected="${3:-200}"
local code
code=$(curl -sfo /dev/null -w "%{http_code}" \
--max-time 10 --connect-timeout 5 "$url" 2>/dev/null || echo "000")
if [[ "$code" == "$expected" ]]; then
log_info "HTTP OK: $name ($url) => $code"
else
log_error "HTTP FAIL: $name ($url) expected=$expected got=$code"
"$ALERT_SCRIPT" "HTTP check failed: $name returned $code (expected $expected)"
(( FAIL_COUNT++ ))
fi
}
# ── TCP 检查 ──────────────────────────────────────────────
check_tcp() {
local name="$1" host="$2" port="$3"
if nc -z -w 3 "$host" "$port" &>/dev/null; then
log_info "TCP OK: $name ($host:$port)"
else
log_error "TCP FAIL: $name ($host:$port) unreachable"
"$ALERT_SCRIPT" "TCP check failed: $name ($host:$port) is unreachable"
(( FAIL_COUNT++ ))
fi
}
# ── 进程检查 ──────────────────────────────────────────────
check_process() {
local name="$1" proc="$2"
if pgrep -x "$proc" &>/dev/null; then
log_info "PROC OK: $name ($proc running)"
else
log_error "PROC FAIL: $name ($proc not found)"
"$ALERT_SCRIPT" "Process check failed: $name ($proc) is not running"
(( FAIL_COUNT++ ))
fi
}
# ── 运行所有检查(从 config.env 读取目标列表)────────────
# 格式: HTTP_CHECKS="name,url,expected_code name2,url2,200"
IFS=' ' read -ra http_list
## 5. 资源告警(monitor/resource_alert.sh)
```bash
#!/usr/bin/env bash
# monitor/resource_alert.sh — CPU/内存/磁盘告警
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"
require_cmd awk df free
ALERT_SCRIPT="$SCRIPT_DIR/../alert/webhook.sh"
# 默认阈值(可在 config.env 中覆盖)
CPU_THRESHOLD="${CPU_THRESHOLD:-85}"
MEM_THRESHOLD="${MEM_THRESHOLD:-90}"
DISK_THRESHOLD="${DISK_THRESHOLD:-85}"
# ── CPU 使用率 ────────────────────────────────────────────
cpu_usage() {
# 读取 /proc/stat 计算两次快照之间的 CPU 占用率
local cpu1 cpu2 idle1 idle2 total1 total2
read -r _ cpu1 = DISK_THRESHOLD )); then
log_warn "Disk usage ${usage}% on $mnt (threshold: ${DISK_THRESHOLD}%)"
"$ALERT_SCRIPT" "Disk alert: ${usage}% used on $mnt"
fi
done
}
# ── 执行检查 ──────────────────────────────────────────────
cpu=$(cpu_usage)
mem=$(mem_usage)
log_info "Resource snapshot — CPU: ${cpu}% MEM: ${mem}%"
if (( cpu >= CPU_THRESHOLD )); then
log_warn "CPU usage ${cpu}% exceeds threshold ${CPU_THRESHOLD}%"
"$ALERT_SCRIPT" "CPU alert: ${cpu}% (threshold: ${CPU_THRESHOLD}%)"
fi
if (( mem >= MEM_THRESHOLD )); then
log_warn "Memory usage ${mem}% exceeds threshold ${MEM_THRESHOLD}%"
"$ALERT_SCRIPT" "Memory alert: ${mem}% (threshold: ${MEM_THRESHOLD}%)"
fi
check_disk
6. 日志分析(monitor/log_analyzer.sh)
#!/usr/bin/env bash
# monitor/log_analyzer.sh — 关键字扫描与错误频率统计
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"
require_cmd grep awk tail
ALERT_SCRIPT="$SCRIPT_DIR/../alert/webhook.sh"
LOG_TARGET="${APP_LOG:-/var/log/app/app.log}"
ERROR_THRESHOLD="${ERROR_THRESHOLD:-10}" # 每分钟错误数阈值
WINDOW_SECONDS=60
# ── 统计最近 N 秒内的错误数 ───────────────────────────────
count_recent_errors() {
local log="$1"
local since
since=$(date -d "${WINDOW_SECONDS} seconds ago" '+%Y-%m-%d %H:%M:%S' 2>/dev/null \
|| date -v -"${WINDOW_SECONDS}"S '+%Y-%m-%d %H:%M:%S') # macOS 兼容
# 统计包含 ERROR 或 FATAL 的行数
awk -v since="$since" '
$0 >= since { count++ }
/ERROR|FATAL|PANIC/ && $0 >= since { err++ }
END { print err+0 }
' "$log"
}
# ── 关键字告警 ────────────────────────────────────────────
scan_keywords() {
local log="$1"
local -a keywords=("OutOfMemoryError" "SIGSEGV" "disk full" "connection refused")
local kw
for kw in "${keywords[@]}"; do
local cnt
cnt=$(grep -c "$kw" "$log" 2>/dev/null || echo 0)
if (( cnt > 0 )); then
log_warn "Keyword '$kw' found $cnt time(s) in $log"
"$ALERT_SCRIPT" "Log alert: '$kw' occurred $cnt time(s) in $(basename "$log")"
fi
done
}
# ── 实时跟踪模式(后台运行)──────────────────────────────
follow_log() {
local log="$1"
log_info "Starting real-time log tail on $log"
tail -F "$log" 2>/dev/null | grep --line-buffered -E 'ERROR|FATAL|PANIC' | \
while IFS= read -r line; do
log_error "Log event: $line"
"$ALERT_SCRIPT" "Real-time log alert: $line"
done
}
# ── 主流程 ────────────────────────────────────────────────
if [[ ! -f "$LOG_TARGET" ]]; then
log_warn "Log file not found: $LOG_TARGET"
exit 0
fi
err_count=$(count_recent_errors "$LOG_TARGET")
log_info "Errors in last ${WINDOW_SECONDS}s: $err_count (threshold: $ERROR_THRESHOLD)"
if (( err_count >= ERROR_THRESHOLD )); then
"$ALERT_SCRIPT" "Log alert: $err_count errors in ${WINDOW_SECONDS}s on $(hostname)"
fi
scan_keywords "$LOG_TARGET"
7. Webhook 告警(alert/webhook.sh)
告警脚本支持钉钉机器人、飞书机器人和邮件三种渠道。为避免告警风暴,引入时间戳文件锁:同一类告警在 ALERT_COOLDOWN 秒内只触发一次。
#!/usr/bin/env bash
# alert/webhook.sh — 多渠道告警(钉钉/飞书/邮件)+ 去重
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"
require_cmd curl
MESSAGE="${1:-Alert from $(hostname)}"
ALERT_COOLDOWN="${ALERT_COOLDOWN:-300}" # 5 分钟冷却期
COOLDOWN_DIR="/tmp/ops-alert-cooldown"
mkdir -p "$COOLDOWN_DIR"
# ── 告警去重 ──────────────────────────────────────────────
# 用消息的 MD5 作为标识(避免文件名特殊字符问题)
msg_hash=$(printf '%s' "$MESSAGE" | md5sum | cut -d' ' -f1)
cooldown_file="$COOLDOWN_DIR/$msg_hash"
if [[ -f "$cooldown_file" ]]; then
last=$(cat "$cooldown_file")
now=$(date +%s)
if (( now - last "$cooldown_file"
# ── 钉钉机器人 ────────────────────────────────────────────
send_dingtalk() {
[[ -z "${DINGTALK_WEBHOOK:-}" ]] && return 0
local body
body=$(printf '{"msgtype":"text","text":{"content":"[OPS ALERT] %s\nHost: %s\nTime: %s"}}' \
"$MESSAGE" "$(hostname)" "$(date '+%Y-%m-%d %H:%M:%S')")
local code
code=$(http_post "$DINGTALK_WEBHOOK" "$body")
if [[ "$code" == "200" ]]; then
log_info "DingTalk alert sent"
else
log_warn "DingTalk alert failed (HTTP $code)"
fi
}
# ── 飞书机器人 ────────────────────────────────────────────
send_feishu() {
[[ -z "${FEISHU_WEBHOOK:-}" ]] && return 0
local body
body=$(printf '{"msg_type":"text","content":{"text":"[OPS ALERT] %s\nHost: %s\nTime: %s"}}' \
"$MESSAGE" "$(hostname)" "$(date '+%Y-%m-%d %H:%M:%S')")
local code
code=$(http_post "$FEISHU_WEBHOOK" "$body")
if [[ "$code" == "200" ]]; then
log_info "Feishu alert sent"
else
log_warn "Feishu alert failed (HTTP $code)"
fi
}
# ── 邮件告警(需配置 SMTP 或本地 sendmail)──────────────
send_email() {
[[ -z "${ALERT_EMAIL:-}" ]] && return 0
local subject="[OPS ALERT] $(hostname) - $(date '+%H:%M')"
if command -v mail &>/dev/null; then
echo "$MESSAGE" | mail -s "$subject" "$ALERT_EMAIL"
log_info "Email alert sent to $ALERT_EMAIL"
fi
}
# ── 发送所有渠道 ──────────────────────────────────────────
log_warn "Sending alert: $MESSAGE"
send_dingtalk
send_feishu
send_email
8. 自动化部署(deploy/rolling_deploy.sh)
#!/usr/bin/env bash
# deploy/rolling_deploy.sh — 滚动重启 + 自动回滚
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"
require_cmd git systemctl curl
lock_file "rolling_deploy"
APP_DIR="${APP_DIR:-/opt/app}"
SERVICE_NAME="${SERVICE_NAME:-myapp}"
HEALTH_URL="${HEALTH_URL:-http://localhost:8080/health}"
INSTANCES="${INSTANCES:-instance1 instance2 instance3}"
ROLLBACK_ON_FAIL="${ROLLBACK_ON_FAIL:-true}"
# ── 记录当前 commit 用于回滚 ─────────────────────────────
OLD_COMMIT=$(git -C "$APP_DIR" rev-parse HEAD)
# ── 拉取最新代码 ──────────────────────────────────────────
log_info "Pulling latest code..."
git -C "$APP_DIR" pull --ff-only || die "git pull failed"
NEW_COMMIT=$(git -C "$APP_DIR" rev-parse HEAD)
log_info "Deploying $OLD_COMMIT -> $NEW_COMMIT"
# ── 构建 ──────────────────────────────────────────────────
log_info "Building..."
make -C "$APP_DIR" -j"$(nproc)" || {
log_error "Build failed; rolling back"
git -C "$APP_DIR" checkout "$OLD_COMMIT"
die "Build failed"
}
# ── health_check 辅助函数 ─────────────────────────────────
wait_healthy() {
local url="$1" retries=10 i=0
while (( i
## 9. 备份系统(backup/backup.sh)
备份使用 `rsync --link-dest` 实现快照式增量备份(每次快照都是完整目录视图,但实际只占用变化部分的磁盘空间)。可选对备份文件进行 AES-256 加密。
```bash
#!/usr/bin/env bash
# backup/backup.sh — rsync 快照备份 + AES-256 加密 + 轮转
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"
require_cmd rsync
BACKUP_SRC="${BACKUP_SRC:-/opt/app/data}"
BACKUP_DEST="${BACKUP_DEST:-/backup/app}"
KEEP_DAILY="${KEEP_DAILY:-7}"
KEEP_WEEKLY="${KEEP_WEEKLY:-4}"
ENCRYPT="${ENCRYPT_BACKUP:-false}"
PASSPHRASE="${BACKUP_PASSPHRASE:-}"
lock_file "backup"
DATE=$(date '+%Y-%m-%d_%H-%M-%S')
SNAPSHOT_DIR="$BACKUP_DEST/daily/$DATE"
LATEST_LINK="$BACKUP_DEST/latest"
mkdir -p "$BACKUP_DEST/daily"
# ── rsync 快照备份 ────────────────────────────────────────
log_info "Starting rsync snapshot: $BACKUP_SRC -> $SNAPSHOT_DIR"
rsync_opts=(
-aAX # 归档+ACL+扩展属性
--delete # 删除源中已不存在的文件
--numeric-ids # 保留数字 UID/GID
--info=progress2
)
# 如果存在上次备份,用 --link-dest 节省磁盘(硬链接未变化的文件)
if [[ -d "$LATEST_LINK" ]]; then
rsync_opts+=(--link-dest="$LATEST_LINK")
fi
rsync "${rsync_opts[@]}" "$BACKUP_SRC/" "$SNAPSHOT_DIR/" \
|| die "rsync failed"
# 更新 latest 符号链接
ln -sfn "$SNAPSHOT_DIR" "$LATEST_LINK"
log_info "Snapshot complete: $SNAPSHOT_DIR"
# ── 可选:AES-256 加密 ────────────────────────────────────
if [[ "$ENCRYPT" == "true" ]]; then
[[ -z "$PASSPHRASE" ]] && die "BACKUP_PASSPHRASE must be set when ENCRYPT_BACKUP=true"
require_cmd openssl tar
log_info "Encrypting backup..."
ARCHIVE="$BACKUP_DEST/daily/${DATE}.tar.gz.enc"
tar -czf - -C "$BACKUP_DEST/daily" "$DATE" | \
openssl enc -aes-256-cbc -pbkdf2 -iter 100000 \
-pass "pass:$PASSPHRASE" -out "$ARCHIVE"
# 加密成功后删除明文快照
rm -rf "$SNAPSHOT_DIR"
log_info "Encrypted archive: $ARCHIVE"
fi
# ── 保留策略:删除超期的日备份 ───────────────────────────
log_info "Rotating daily backups (keep last $KEEP_DAILY)..."
ls -1dt "$BACKUP_DEST"/daily/*/ 2>/dev/null | tail -n +"$((KEEP_DAILY+1))" | \
xargs -r rm -rf
# 周备份:每周日保留一份(判断今天是否为周日)
if [[ "$(date '+%u')" == "7" ]]; then
mkdir -p "$BACKUP_DEST/weekly"
cp -al "$LATEST_LINK" "$BACKUP_DEST/weekly/$(date '+%Y-W%V')" 2>/dev/null || true
ls -1dt "$BACKUP_DEST"/weekly/*/ 2>/dev/null | tail -n +"$((KEEP_WEEKLY+1))" | \
xargs -r rm -rf
log_info "Weekly backup saved"
fi
log_info "Backup finished successfully"
10. systemd 集成
使用 systemd timer 替代 crontab,获得完整的日志集成、依赖管理和错误恢复能力(参见第13章)。
# systemd/ops-monitor.service
[Unit]
Description=Ops health check and resource alert
After=network-online.target
[Service]
Type=oneshot
User=opsuser
WorkingDirectory=/opt/ops-scripts
ExecStart=/opt/ops-scripts/monitor/health_check.sh
ExecStart=/opt/ops-scripts/monitor/resource_alert.sh
EnvironmentFile=/opt/ops-scripts/config.env
StandardOutput=journal
StandardError=journal
SyslogIdentifier=ops-monitor
# systemd/ops-monitor.timer
[Unit]
Description=Run ops monitoring every 5 minutes
[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
Persistent=true
RandomizedDelaySec=30
[Install]
WantedBy=timers.target
# 部署步骤
sudo cp /opt/ops-scripts/systemd/*.service /etc/systemd/system/
sudo cp /opt/ops-scripts/systemd/*.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now ops-monitor.timer
sudo systemctl enable --now ops-backup.timer
# 验证
systemctl list-timers --all | grep ops
journalctl -u ops-monitor.service -f
11. bats 自动化测试
bats(Bash Automated Testing System)是 Shell 脚本的单元测试框架。每个 @test 块是一个测试用例,通过 run 命令执行被测脚本,然后断言输出和退出码。
#!/usr/bin/env bats
# tests/test_common.bats
load '../lib/common.sh'
setup() {
export LOG_FILE="$(mktemp)"
export LOCK_DIR="$(mktemp -d)"
}
teardown() {
rm -f "$LOG_FILE"
rm -rf "$LOCK_DIR"
}
@test "log_info writes timestamped INFO message" {
run log_info "hello world"
[ "$status" -eq 0 ]
grep -q "INFO" "$LOG_FILE"
grep -q "hello world" "$LOG_FILE"
}
@test "die exits with code 1 and logs error" {
run die "something went wrong"
[ "$status" -eq 1 ]
grep -q "ERROR" "$LOG_FILE"
grep -q "something went wrong" "$LOG_FILE"
}
@test "require_cmd succeeds for existing command" {
run require_cmd bash
[ "$status" -eq 0 ]
}
@test "require_cmd fails for nonexistent command" {
run require_cmd this_command_does_not_exist_xyz
[ "$status" -ne 0 ]
}
@test "lock_file prevents double locking" {
lock_file "testlock"
run bash -c "source lib/common.sh; lock_file testlock"
[ "$status" -ne 0 ]
[[ "$output" == *"Lock held"* ]]
}
#!/usr/bin/env bats
# tests/test_webhook.bats — 测试告警去重逻辑
setup() {
export LOG_FILE="$(mktemp)"
export LOCK_DIR="$(mktemp -d)"
export ALERT_COOLDOWN=300
export COOLDOWN_DIR="$(mktemp -d)"
# 禁用实际发送(mock 掉 webhook 变量)
unset DINGTALK_WEBHOOK FEISHU_WEBHOOK ALERT_EMAIL
}
teardown() {
rm -f "$LOG_FILE"
rm -rf "$LOCK_DIR" "$COOLDOWN_DIR"
}
@test "first alert goes through" {
run bash alert/webhook.sh "test alert message"
[ "$status" -eq 0 ]
# cooldown 文件应被创建
local hash
hash=$(printf '%s' "test alert message" | md5sum | cut -d' ' -f1)
[ -f "$COOLDOWN_DIR/$hash" ]
}
@test "duplicate alert within cooldown is suppressed" {
# 先发一次
bash alert/webhook.sh "dup message"
# 再次发送,应被抑制
run bash alert/webhook.sh "dup message"
[ "$status" -eq 0 ]
grep -q "suppressed" "$LOG_FILE"
}
GitHub Actions CI 集成
# .github/workflows/test.yml
name: Shell Tests
on: [push, pull_request]
jobs:
bats:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install bats
run: |
sudo apt-get install -y bats
bats --version
- name: Run bats tests
run: bats tests/
- name: Run shellcheck
run: |
sudo apt-get install -y shellcheck
shellcheck lib/*.sh monitor/*.sh alert/*.sh deploy/*.sh backup/*.sh
12. 全书总结与下一步
本书构建了五层能力体系,本章综合项目是这五层能力的集中体现:
| 能力层 | 涵盖章节 | 在本项目中的体现 |
|---|---|---|
| 基础操作 | Ch1–Ch4 | 目录结构、文本处理(awk/grep)、日志分析 |
| 系统理解 | Ch5–Ch8 | 进程检查(pgrep)、网络检查(nc/curl)、磁盘监控(df/rsync) |
| Shell 编程 | Ch9–Ch12 | 函数库、关联数组、管道、set -euo pipefail、bats 测试 |
| 生产实战 | Ch13–Ch16 | systemd service/timer、性能指标采集、加密备份、互斥锁 |
| 内核与贡献 | Ch17–Ch19 | 理解系统调用路径、掌握内核级调试工具、具备向上游贡献的能力 |
下一步学习路径
- 深入 Go 系统编程 — Shell 擅长胶水脚本,Go 适合性能敏感的工具。推荐:《Systems Programming in Go》
- eBPF 可观测性 — BCC/bpftrace 是现代 Linux 性能分析的利器,是第14章的自然延伸
- Ansible / Terraform — 把本章脚本系统进一步升级为声明式基础设施即代码
- 内核贡献实践 — 按第19章指引,从 drivers/staging 的 checkpatch 修复开始,提交你的第一个内核补丁
恭喜完成全书! Linux Shell 从来不只是命令行工具的堆砌。它是一种与操作系统对话的语言,一种把复杂系统变得可控、可观测、可自动化的思维方式。带着这本书学到的能力,你已经具备了在生产环境独当一面的系统工程师素养。
上一章
← 第19章:内核贡献
返回目录
全书目录 →