Chapter 20

Final Project

Chapter 20: Final Project — Production-Grade DevOps Script System

This chapter synthesizes every skill from the book into a complete, production-grade operations scripting system. It covers service health monitoring, resource alerting, log analysis, automated deployment, backup and recovery, webhook notifications, systemd integration, and bats automated testing. Every line of code maps back to a specific earlier chapter. After finishing this chapter you will have a deployable ops infrastructure foundation.

1. Project Overview and Architecture

The core design principles of this ops scripting system: single responsibility (each script does one thing), observable (unified log format), idempotent (repeated execution has no side effects), testable (critical logic covered by bats tests).

Module	Function	Source Chapter
lib/common.sh	Shared function library	Ch9, Ch10, Ch11
monitor/health_check.sh	HTTP/TCP/process monitoring	Ch7, Ch5
monitor/resource_alert.sh	CPU/memory/disk alerting	Ch14, Ch8
monitor/log_analyzer.sh	Log keyword scanning	Ch4, Ch11
alert/webhook.sh	DingTalk/Feishu notifications	Ch7, Ch12
deploy/rolling_deploy.sh	Rolling deploy and rollback	Ch5, Ch12
backup/backup.sh	rsync+encrypted backup	Ch8, Ch15
systemd/	systemd service/timer units	Ch13
tests/	bats automated tests	Ch12

2. Project Directory Structure

ops-scripts/ ├── lib/ │ └── common.sh # shared function library ├── monitor/ │ ├── health_check.sh # HTTP/TCP/process checks │ ├── resource_alert.sh # resource threshold alerting │ └── log_analyzer.sh # log keyword analysis ├── alert/ │ └── webhook.sh # DingTalk/Feishu/email notify ├── deploy/ │ └── rolling_deploy.sh # rolling deploy and rollback ├── backup/ │ ├── backup.sh # rsync snapshot backup+encryption │ └── restore.sh # restore script ├── systemd/ │ ├── ops-monitor.service │ ├── ops-monitor.timer │ └── ops-backup.service ├── tests/ │ ├── test_common.bats │ ├── test_webhook.bats │ └── test_backup.bats ├── config.env # config file (not committed to git) └── config.env.example # config template

3. Shared Function Library (lib/common.sh)

Every script sources the common library via source "$(dirname "$0")/../lib/common.sh". The library provides unified log formatting, dependency checking, mutex locking, and HTTP helper functions.

#!/usr/bin/env bash
# lib/common.sh — 公共函数库 / shared function library
set -euo pipefail

# ── 颜色常量（仅 tty 时启用）────────────────────────────
if [[ -t 2 ]]; then
  RED='\033[0;31m'; YELLOW='\033[1;33m'
  GREEN='\033[0;32m'; CYAN='\033[0;36m'
  NC='\033[0m'
else
  RED=''; YELLOW=''; GREEN=''; CYAN=''; NC=''
fi

# ── 日志函数 ─────────────────────────────────────────────
LOG_FILE="${LOG_FILE:-/var/log/ops-scripts/ops.log}"

_log() {
  local level="$1"; shift
  local ts
  ts=$(date '+%Y-%m-%dT%H:%M:%S%z')
  # 同时输出到 stderr 和日志文件
  printf '%s [%s] %s\n' "$ts" "$level" "$*" | tee -a "$LOG_FILE" >&2
}

log_info()  { _log "INFO " "${CYAN}$*${NC}"; }
log_warn()  { _log "WARN " "${YELLOW}$*${NC}"; }
log_error() { _log "ERROR" "${RED}$*${NC}"; }

die() {
  log_error "$*"
  exit 1
}

# ── 依赖检查 ─────────────────────────────────────────────
require_cmd() {
  local cmd
  for cmd in "$@"; do
    command -v "$cmd" &>/dev/null || die "Required command not found: $cmd"
  done
}

# ── 互斥锁 ───────────────────────────────────────────────
LOCK_DIR="${LOCK_DIR:-/tmp/ops-locks}"
mkdir -p "$LOCK_DIR"

lock_file() {
  local name="$1"
  local lock="$LOCK_DIR/${name}.lock"
  if ! mkdir "$lock" 2>/dev/null; then
    local pid
    pid=$(cat "$lock/pid" 2>/dev/null || echo "unknown")
    die "Lock held by PID $pid: $lock"
  fi
  echo $$ > "$lock/pid"
  # 注册退出时自动释放锁
  trap "unlock_file '$name'" EXIT INT TERM
}

unlock_file() {
  local name="$1"
  rm -rf "$LOCK_DIR/${name}.lock"
}

# ── HTTP 工具 ─────────────────────────────────────────────
# http_post URL JSON_BODY — 发送 POST 请求，返回 HTTP 状态码
http_post() {
  local url="$1"
  local body="$2"
  curl -s -o /dev/null -w "%{http_code}" \
    -H 'Content-Type: application/json' \
    --data "$body" \
    --max-time 10 \
    "$url"
}

# ── 配置加载 ──────────────────────────────────────────────
load_config() {
  local cfg="${1:-config.env}"
  [[ -f "$cfg" ]] || die "Config file not found: $cfg"
  # 安全加载：只允许 KEY=VALUE 格式，过滤注释和空行
  while IFS='=' read -r key value; do
    [[ "$key" =~ ^[A-Za-z_][A-Za-z0-9_]*$ ]] || continue
    export "$key"="$value"
  done 
  
## 4. Service Health Monitoring (monitor/health_check.sh)


  
The health check script supports three check types: HTTP status code check, TCP port reachability, and process liveness check. Configuration comes from `config.env`; failures call `alert/webhook.sh` to send alerts.


  
```bash
#!/usr/bin/env bash
# monitor/health_check.sh
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"

require_cmd curl nc pgrep

ALERT_SCRIPT="$SCRIPT_DIR/../alert/webhook.sh"
FAIL_COUNT=0

# ── HTTP 检查 ─────────────────────────────────────────────
check_http() {
  local name="$1" url="$2" expected="${3:-200}"
  local code
  code=$(curl -sfo /dev/null -w "%{http_code}" \
    --max-time 10 --connect-timeout 5 "$url" 2>/dev/null || echo "000")

  if [[ "$code" == "$expected" ]]; then
    log_info "HTTP OK: $name ($url) => $code"
  else
    log_error "HTTP FAIL: $name ($url) expected=$expected got=$code"
    "$ALERT_SCRIPT" "HTTP check failed: $name returned $code (expected $expected)"
    (( FAIL_COUNT++ ))
  fi
}

# ── TCP 检查 ──────────────────────────────────────────────
check_tcp() {
  local name="$1" host="$2" port="$3"
  if nc -z -w 3 "$host" "$port" &>/dev/null; then
    log_info "TCP OK: $name ($host:$port)"
  else
    log_error "TCP FAIL: $name ($host:$port) unreachable"
    "$ALERT_SCRIPT" "TCP check failed: $name ($host:$port) is unreachable"
    (( FAIL_COUNT++ ))
  fi
}

# ── 进程检查 ──────────────────────────────────────────────
check_process() {
  local name="$1" proc="$2"
  if pgrep -x "$proc" &>/dev/null; then
    log_info "PROC OK: $name ($proc running)"
  else
    log_error "PROC FAIL: $name ($proc not found)"
    "$ALERT_SCRIPT" "Process check failed: $name ($proc) is not running"
    (( FAIL_COUNT++ ))
  fi
}

# ── 运行所有检查（从 config.env 读取目标列表）────────────
# 格式: HTTP_CHECKS="name,url,expected_code name2,url2,200"
IFS=' ' read -ra http_list 
  
## 5. Resource Alerting (monitor/resource_alert.sh)


  
```bash
#!/usr/bin/env bash
# monitor/resource_alert.sh — CPU/内存/磁盘告警
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"

require_cmd awk df free

ALERT_SCRIPT="$SCRIPT_DIR/../alert/webhook.sh"

# 默认阈值（可在 config.env 中覆盖）
CPU_THRESHOLD="${CPU_THRESHOLD:-85}"
MEM_THRESHOLD="${MEM_THRESHOLD:-90}"
DISK_THRESHOLD="${DISK_THRESHOLD:-85}"

# ── CPU 使用率 ────────────────────────────────────────────
cpu_usage() {
  # 读取 /proc/stat 计算两次快照之间的 CPU 占用率
  local cpu1 cpu2 idle1 idle2 total1 total2
  read -r _ cpu1 = DISK_THRESHOLD )); then
      log_warn "Disk usage ${usage}% on $mnt (threshold: ${DISK_THRESHOLD}%)"
      "$ALERT_SCRIPT" "Disk alert: ${usage}% used on $mnt"
    fi
  done
}

# ── 执行检查 ──────────────────────────────────────────────
cpu=$(cpu_usage)
mem=$(mem_usage)

log_info "Resource snapshot — CPU: ${cpu}%  MEM: ${mem}%"

if (( cpu >= CPU_THRESHOLD )); then
  log_warn "CPU usage ${cpu}% exceeds threshold ${CPU_THRESHOLD}%"
  "$ALERT_SCRIPT" "CPU alert: ${cpu}% (threshold: ${CPU_THRESHOLD}%)"
fi

if (( mem >= MEM_THRESHOLD )); then
  log_warn "Memory usage ${mem}% exceeds threshold ${MEM_THRESHOLD}%"
  "$ALERT_SCRIPT" "Memory alert: ${mem}% (threshold: ${MEM_THRESHOLD}%)"
fi

check_disk

6. Log Analysis (monitor/log_analyzer.sh)

#!/usr/bin/env bash
# monitor/log_analyzer.sh — 关键字扫描与错误频率统计
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"

require_cmd grep awk tail

ALERT_SCRIPT="$SCRIPT_DIR/../alert/webhook.sh"
LOG_TARGET="${APP_LOG:-/var/log/app/app.log}"
ERROR_THRESHOLD="${ERROR_THRESHOLD:-10}"  # 每分钟错误数阈值
WINDOW_SECONDS=60

# ── 统计最近 N 秒内的错误数 ───────────────────────────────
count_recent_errors() {
  local log="$1"
  local since
  since=$(date -d "${WINDOW_SECONDS} seconds ago" '+%Y-%m-%d %H:%M:%S' 2>/dev/null \
       || date -v -"${WINDOW_SECONDS}"S '+%Y-%m-%d %H:%M:%S')  # macOS 兼容

  # 统计包含 ERROR 或 FATAL 的行数
  awk -v since="$since" '
    $0 >= since { count++ }
    /ERROR|FATAL|PANIC/ && $0 >= since { err++ }
    END { print err+0 }
  ' "$log"
}

# ── 关键字告警 ────────────────────────────────────────────
scan_keywords() {
  local log="$1"
  local -a keywords=("OutOfMemoryError" "SIGSEGV" "disk full" "connection refused")
  local kw
  for kw in "${keywords[@]}"; do
    local cnt
    cnt=$(grep -c "$kw" "$log" 2>/dev/null || echo 0)
    if (( cnt > 0 )); then
      log_warn "Keyword '$kw' found $cnt time(s) in $log"
      "$ALERT_SCRIPT" "Log alert: '$kw' occurred $cnt time(s) in $(basename "$log")"
    fi
  done
}

# ── 实时跟踪模式（后台运行）──────────────────────────────
follow_log() {
  local log="$1"
  log_info "Starting real-time log tail on $log"
  tail -F "$log" 2>/dev/null | grep --line-buffered -E 'ERROR|FATAL|PANIC' | \
  while IFS= read -r line; do
    log_error "Log event: $line"
    "$ALERT_SCRIPT" "Real-time log alert: $line"
  done
}

# ── 主流程 ────────────────────────────────────────────────
if [[ ! -f "$LOG_TARGET" ]]; then
  log_warn "Log file not found: $LOG_TARGET"
  exit 0
fi

err_count=$(count_recent_errors "$LOG_TARGET")
log_info "Errors in last ${WINDOW_SECONDS}s: $err_count (threshold: $ERROR_THRESHOLD)"

if (( err_count >= ERROR_THRESHOLD )); then
  "$ALERT_SCRIPT" "Log alert: $err_count errors in ${WINDOW_SECONDS}s on $(hostname)"
fi

scan_keywords "$LOG_TARGET"

7. Webhook Alerting (alert/webhook.sh)

The alert script supports three channels: DingTalk robot, Feishu robot, and email. To prevent alert storms, a timestamp file lock ensures the same alert type fires at most once per ALERT_COOLDOWN seconds.

#!/usr/bin/env bash
# alert/webhook.sh — 多渠道告警（钉钉/飞书/邮件）+ 去重
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"

require_cmd curl

MESSAGE="${1:-Alert from $(hostname)}"
ALERT_COOLDOWN="${ALERT_COOLDOWN:-300}"  # 5 分钟冷却期
COOLDOWN_DIR="/tmp/ops-alert-cooldown"
mkdir -p "$COOLDOWN_DIR"

# ── 告警去重 ──────────────────────────────────────────────
# 用消息的 MD5 作为标识（避免文件名特殊字符问题）
msg_hash=$(printf '%s' "$MESSAGE" | md5sum | cut -d' ' -f1)
cooldown_file="$COOLDOWN_DIR/$msg_hash"

if [[ -f "$cooldown_file" ]]; then
  last=$(cat "$cooldown_file")
  now=$(date +%s)
  if (( now - last  "$cooldown_file"

# ── 钉钉机器人 ────────────────────────────────────────────
send_dingtalk() {
  [[ -z "${DINGTALK_WEBHOOK:-}" ]] && return 0
  local body
  body=$(printf '{"msgtype":"text","text":{"content":"[OPS ALERT] %s\nHost: %s\nTime: %s"}}' \
    "$MESSAGE" "$(hostname)" "$(date '+%Y-%m-%d %H:%M:%S')")
  local code
  code=$(http_post "$DINGTALK_WEBHOOK" "$body")
  if [[ "$code" == "200" ]]; then
    log_info "DingTalk alert sent"
  else
    log_warn "DingTalk alert failed (HTTP $code)"
  fi
}

# ── 飞书机器人 ────────────────────────────────────────────
send_feishu() {
  [[ -z "${FEISHU_WEBHOOK:-}" ]] && return 0
  local body
  body=$(printf '{"msg_type":"text","content":{"text":"[OPS ALERT] %s\nHost: %s\nTime: %s"}}' \
    "$MESSAGE" "$(hostname)" "$(date '+%Y-%m-%d %H:%M:%S')")
  local code
  code=$(http_post "$FEISHU_WEBHOOK" "$body")
  if [[ "$code" == "200" ]]; then
    log_info "Feishu alert sent"
  else
    log_warn "Feishu alert failed (HTTP $code)"
  fi
}

# ── 邮件告警（需配置 SMTP 或本地 sendmail）──────────────
send_email() {
  [[ -z "${ALERT_EMAIL:-}" ]] && return 0
  local subject="[OPS ALERT] $(hostname) - $(date '+%H:%M')"
  if command -v mail &>/dev/null; then
    echo "$MESSAGE" | mail -s "$subject" "$ALERT_EMAIL"
    log_info "Email alert sent to $ALERT_EMAIL"
  fi
}

# ── 发送所有渠道 ──────────────────────────────────────────
log_warn "Sending alert: $MESSAGE"
send_dingtalk
send_feishu
send_email

8. Automated Deployment (deploy/rolling_deploy.sh)

#!/usr/bin/env bash
# deploy/rolling_deploy.sh — 滚动重启 + 自动回滚
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"

require_cmd git systemctl curl

lock_file "rolling_deploy"

APP_DIR="${APP_DIR:-/opt/app}"
SERVICE_NAME="${SERVICE_NAME:-myapp}"
HEALTH_URL="${HEALTH_URL:-http://localhost:8080/health}"
INSTANCES="${INSTANCES:-instance1 instance2 instance3}"
ROLLBACK_ON_FAIL="${ROLLBACK_ON_FAIL:-true}"

# ── 记录当前 commit 用于回滚 ─────────────────────────────
OLD_COMMIT=$(git -C "$APP_DIR" rev-parse HEAD)

# ── 拉取最新代码 ──────────────────────────────────────────
log_info "Pulling latest code..."
git -C "$APP_DIR" pull --ff-only || die "git pull failed"

NEW_COMMIT=$(git -C "$APP_DIR" rev-parse HEAD)
log_info "Deploying $OLD_COMMIT -> $NEW_COMMIT"

# ── 构建 ──────────────────────────────────────────────────
log_info "Building..."
make -C "$APP_DIR" -j"$(nproc)" || {
  log_error "Build failed; rolling back"
  git -C "$APP_DIR" checkout "$OLD_COMMIT"
  die "Build failed"
}

# ── health_check 辅助函数 ─────────────────────────────────
wait_healthy() {
  local url="$1" retries=10 i=0
  while (( i 
  
## 9. Backup System (backup/backup.sh)


  
Backups use `rsync --link-dest` for snapshot-style incremental backups — each snapshot looks like a full directory but only the changed files consume additional disk space. Optionally encrypts the backup with AES-256.


  
```bash
#!/usr/bin/env bash
# backup/backup.sh — rsync 快照备份 + AES-256 加密 + 轮转
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
source "$SCRIPT_DIR/../lib/common.sh"
load_config "$SCRIPT_DIR/../config.env"

require_cmd rsync

BACKUP_SRC="${BACKUP_SRC:-/opt/app/data}"
BACKUP_DEST="${BACKUP_DEST:-/backup/app}"
KEEP_DAILY="${KEEP_DAILY:-7}"
KEEP_WEEKLY="${KEEP_WEEKLY:-4}"
ENCRYPT="${ENCRYPT_BACKUP:-false}"
PASSPHRASE="${BACKUP_PASSPHRASE:-}"

lock_file "backup"

DATE=$(date '+%Y-%m-%d_%H-%M-%S')
SNAPSHOT_DIR="$BACKUP_DEST/daily/$DATE"
LATEST_LINK="$BACKUP_DEST/latest"

mkdir -p "$BACKUP_DEST/daily"

# ── rsync 快照备份 ────────────────────────────────────────
log_info "Starting rsync snapshot: $BACKUP_SRC -> $SNAPSHOT_DIR"
rsync_opts=(
  -aAX                    # 归档+ACL+扩展属性
  --delete                # 删除源中已不存在的文件
  --numeric-ids           # 保留数字 UID/GID
  --info=progress2
)

# 如果存在上次备份，用 --link-dest 节省磁盘（硬链接未变化的文件）
if [[ -d "$LATEST_LINK" ]]; then
  rsync_opts+=(--link-dest="$LATEST_LINK")
fi

rsync "${rsync_opts[@]}" "$BACKUP_SRC/" "$SNAPSHOT_DIR/" \
  || die "rsync failed"

# 更新 latest 符号链接
ln -sfn "$SNAPSHOT_DIR" "$LATEST_LINK"
log_info "Snapshot complete: $SNAPSHOT_DIR"

# ── 可选：AES-256 加密 ────────────────────────────────────
if [[ "$ENCRYPT" == "true" ]]; then
  [[ -z "$PASSPHRASE" ]] && die "BACKUP_PASSPHRASE must be set when ENCRYPT_BACKUP=true"
  require_cmd openssl tar
  log_info "Encrypting backup..."
  ARCHIVE="$BACKUP_DEST/daily/${DATE}.tar.gz.enc"
  tar -czf - -C "$BACKUP_DEST/daily" "$DATE" | \
    openssl enc -aes-256-cbc -pbkdf2 -iter 100000 \
      -pass "pass:$PASSPHRASE" -out "$ARCHIVE"
  # 加密成功后删除明文快照
  rm -rf "$SNAPSHOT_DIR"
  log_info "Encrypted archive: $ARCHIVE"
fi

# ── 保留策略：删除超期的日备份 ───────────────────────────
log_info "Rotating daily backups (keep last $KEEP_DAILY)..."
ls -1dt "$BACKUP_DEST"/daily/*/ 2>/dev/null | tail -n +"$((KEEP_DAILY+1))" | \
  xargs -r rm -rf

# 周备份：每周日保留一份（判断今天是否为周日）
if [[ "$(date '+%u')" == "7" ]]; then
  mkdir -p "$BACKUP_DEST/weekly"
  cp -al "$LATEST_LINK" "$BACKUP_DEST/weekly/$(date '+%Y-W%V')" 2>/dev/null || true
  ls -1dt "$BACKUP_DEST"/weekly/*/ 2>/dev/null | tail -n +"$((KEEP_WEEKLY+1))" | \
    xargs -r rm -rf
  log_info "Weekly backup saved"
fi

log_info "Backup finished successfully"

10. systemd Integration

Use systemd timers instead of crontab for full logging integration, dependency management, and error recovery (see Chapter 13).

# systemd/ops-monitor.service
[Unit]
Description=Ops health check and resource alert
After=network-online.target

[Service]
Type=oneshot
User=opsuser
WorkingDirectory=/opt/ops-scripts
ExecStart=/opt/ops-scripts/monitor/health_check.sh
ExecStart=/opt/ops-scripts/monitor/resource_alert.sh
EnvironmentFile=/opt/ops-scripts/config.env
StandardOutput=journal
StandardError=journal
SyslogIdentifier=ops-monitor

# systemd/ops-monitor.timer
[Unit]
Description=Run ops monitoring every 5 minutes

[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
Persistent=true
RandomizedDelaySec=30

[Install]
WantedBy=timers.target

# 部署步骤
sudo cp /opt/ops-scripts/systemd/*.service /etc/systemd/system/
sudo cp /opt/ops-scripts/systemd/*.timer  /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now ops-monitor.timer
sudo systemctl enable --now ops-backup.timer

# 验证
systemctl list-timers --all | grep ops
journalctl -u ops-monitor.service -f

11. bats Automated Testing

bats (Bash Automated Testing System) is a unit testing framework for shell scripts. Each @test block is a test case; the run command executes the script under test and you then assert output and exit code.

#!/usr/bin/env bats
# tests/test_common.bats

load '../lib/common.sh'

setup() {
  export LOG_FILE="$(mktemp)"
  export LOCK_DIR="$(mktemp -d)"
}

teardown() {
  rm -f "$LOG_FILE"
  rm -rf "$LOCK_DIR"
}

@test "log_info writes timestamped INFO message" {
  run log_info "hello world"
  [ "$status" -eq 0 ]
  grep -q "INFO" "$LOG_FILE"
  grep -q "hello world" "$LOG_FILE"
}

@test "die exits with code 1 and logs error" {
  run die "something went wrong"
  [ "$status" -eq 1 ]
  grep -q "ERROR" "$LOG_FILE"
  grep -q "something went wrong" "$LOG_FILE"
}

@test "require_cmd succeeds for existing command" {
  run require_cmd bash
  [ "$status" -eq 0 ]
}

@test "require_cmd fails for nonexistent command" {
  run require_cmd this_command_does_not_exist_xyz
  [ "$status" -ne 0 ]
}

@test "lock_file prevents double locking" {
  lock_file "testlock"
  run bash -c "source lib/common.sh; lock_file testlock"
  [ "$status" -ne 0 ]
  [[ "$output" == *"Lock held"* ]]
}

#!/usr/bin/env bats
# tests/test_webhook.bats — 测试告警去重逻辑

setup() {
  export LOG_FILE="$(mktemp)"
  export LOCK_DIR="$(mktemp -d)"
  export ALERT_COOLDOWN=300
  export COOLDOWN_DIR="$(mktemp -d)"
  # 禁用实际发送（mock 掉 webhook 变量）
  unset DINGTALK_WEBHOOK FEISHU_WEBHOOK ALERT_EMAIL
}

teardown() {
  rm -f "$LOG_FILE"
  rm -rf "$LOCK_DIR" "$COOLDOWN_DIR"
}

@test "first alert goes through" {
  run bash alert/webhook.sh "test alert message"
  [ "$status" -eq 0 ]
  # cooldown 文件应被创建
  local hash
  hash=$(printf '%s' "test alert message" | md5sum | cut -d' ' -f1)
  [ -f "$COOLDOWN_DIR/$hash" ]
}

@test "duplicate alert within cooldown is suppressed" {
  # 先发一次
  bash alert/webhook.sh "dup message"
  # 再次发送，应被抑制
  run bash alert/webhook.sh "dup message"
  [ "$status" -eq 0 ]
  grep -q "suppressed" "$LOG_FILE"
}

GitHub Actions CI Integration

# .github/workflows/test.yml
name: Shell Tests

on: [push, pull_request]

jobs:
  bats:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install bats
        run: |
          sudo apt-get install -y bats
          bats --version

      - name: Run bats tests
        run: bats tests/

      - name: Run shellcheck
        run: |
          sudo apt-get install -y shellcheck
          shellcheck lib/*.sh monitor/*.sh alert/*.sh deploy/*.sh backup/*.sh

12. Book Summary and Next Steps

This book built five capability layers. The final project is a concrete demonstration of all five:

Capability Layer	Chapters	Embodied In
Foundational Ops	Ch1–Ch4	directory structure, text processing (awk/grep), log analysis
System Understanding	Ch5–Ch8	process checks (pgrep), network probes (nc/curl), disk monitoring (df/rsync)
Shell Scripting	Ch9–Ch12	function library, arrays, pipelines, set -euo pipefail, bats tests
Production Practice	Ch13–Ch16	systemd service/timer, metric collection, encrypted backup, mutex locks
Kernel & Contribution	Ch17–Ch19	understanding syscall paths, kernel-level debugging tools, ability to contribute upstream

Recommended Next Steps

Go Systems Programming — Shell excels at glue scripts; Go suits performance-sensitive tools. See: Systems Programming in Go
eBPF Observability — BCC/bpftrace are modern Linux performance analysis tools — a natural extension of Chapter 14
Ansible / Terraform — Elevate this chapter's script system into declarative infrastructure-as-code
Kernel Contribution Practice — Follow Chapter 19's guidance: start with checkpatch fixes in drivers/staging and submit your first kernel patch

Congratulations on completing the book! Linux Shell is never merely a collection of command-line tools. It is a language for conversing with the operating system — a way of thinking that makes complex systems controllable, observable, and automatable. With the skills acquired in this book, you now have the foundation of a production-capable systems engineer.

  Previous
  ← Ch19: Kernel Contribution


  Back to Index
  Book Index →

Rate this chapter

4.8 / 5 (8 ratings)