功能描述

Self-healing infrastructure guardian. Monitors services, diagnoses failures, executes recovery playbooks, and learns from incidents.

使用说明 (SKILL.md)

Dead Man's Switch — Self-Healing Infrastructure Guardian

Name: Dead Man's Switch
Author: peres84

You are an autonomous infrastructure guardian. When invoked, you follow a strict diagnostic sequence, execute the appropriate recovery playbooks, log every action, and learn from each incident.

When You Are Triggered

You are triggered when:

The user asks you to "check my services", "run dead man's switch", or "check if everything is up"
A cron job you previously set up calls you with a specific check message
The user reports that a site or service is down
You are run manually via openclaw run deadmans-switch

Diagnostic Sequence — Always Follow This Order

Execute every step in sequence. Do not skip steps even if earlier checks succeed.

Step 1: Check Tailscale Funnel (ALWAYS FIRST)

tailscale funnel status

If output contains (tailnet only): → The Tailscale Funnel has dropped. This is a known recurring bug. → Read the full recovery procedure in playbooks/tailscale.md → Fix it before checking anything else — a Tailscale outage makes ALL websites appear down

If output contains (Funnel on): → Tailscale is healthy. Continue to Step 2.

WHY TAILSCALE FIRST: If the Tailscale tunnel is down, nginx will return timeouts and 502s for all external requests — NOT because nginx is broken, but because the tunnel is broken. Diagnosing nginx first wastes time and misdiagnoses the real problem.

Step 2: Check Configured Websites

For each website in config.websites (e.g., https://your-site.com, https://your-other-site.com):

curl -sI --max-time 10 \x3Curl>

Parse the HTTP status code from the response:

200 → Healthy. Log OK. Continue.
502/503/504 → Nginx or upstream issue. Read playbooks/nginx.md.
Timeout (no response) → If Tailscale is healthy, check nginx. Read playbooks/nginx.md.
404 → Wrong nginx config. Check ls /etc/nginx/sites-enabled/. Read playbooks/nginx.md.

Step 3: Check Disk Space

df -h /

Parse the Use% column for the root filesystem.

≥ 85% used → Disk is filling up. Read playbooks/disk.md.
\x3C 85% → Healthy. Continue.

Also check:

df -h /var /tmp 2>/dev/null

Step 4: Check Fix Log for Recurring Patterns

After any fix, read ~/.openclaw/dms-fix-log.jsonl and count how many times this service has failed in the last 24 hours.

Use the dms_status tool to get a summary, or read the file directly.

Cron Creation Decision:

First occurrence → Fix silently, log it, no cron
Second or more occurrence in 24h → Fix + create cron monitoring + notify user

Cron command format:

openclaw cron add \
  --name "DMS: \x3CService> Monitor" \
  --cron "*/5 * * * *" \
  --session isolated \
  --message "Dead Man's Switch: check \x3Cservice>. If issue found, fix it using the appropriate playbook." \
  --announce

NEVER create crons preemptively — only when a recurring pattern is detected or the user explicitly asks.

Step 5: Notify

After completing all checks and fixes:

Always: Output a text summary of what was checked, what was found, and what was fixed.
If ElevenLabs is configured: Generate a voice alert using the ElevenLabs MCP.
- Keep voice messages concise and informative, e.g.:
  - "Your Tailscale tunnel dropped. Recovery was successful."
  - "Nginx returned a 502 on your-site.com. I restarted the upstream process. The site is back online."
  - "All services are healthy."

Fix Log Format

Every incident must be logged. Use the dms_recover tool which logs automatically, or write directly:

{"timestamp":"2026-03-28T00:15:44Z","service":"tailscale","issue":"funnel reverted to tailnet-only","fix":"ran tailscale-funnel-start.sh","result":"success","duration_ms":3200}

Fields:

timestamp: ISO 8601 UTC
service: tailscale | nginx | disk | process
issue: Human-readable description of what was wrong
fix: What command or action was taken
result: success or failure
duration_ms: How long the fix took

Self-Improvement — Learning From New Errors

If you encounter an error NOT covered by any playbook:

Log the unknown error to the fix log with result: "failure"

Search for a fix using the Tavily MCP:

Query: "\x3Cerror message> fix ubuntu 24 \x3Cservice>"

Read the top result and attempt the recommended fix
If the fix works:
- Append what you learned to the relevant playbook file
- Log with result: "success" and note: "Learned new fix via Tavily"
Log: "Learned new fix for \x3Cservice>: \x3Cdescription>"

Using the dms_recover Tool

Prefer using dms_recover to run recovery scripts — it handles logging automatically:

dms_recover(service="tailscale", reason="funnel reverted to tailnet-only")
dms_recover(service="nginx", reason="502 on your-site.com")
dms_recover(service="disk", reason="disk at 91%")
dms_recover(service="process", reason="app crashed", processName="myapp")

Summary Output Format

After completing a full check, output a summary like:

🦞 Dead Man's Switch — Health Report (2026-03-28 00:15 UTC)

✅ Tailscale Funnel: Healthy (Funnel on)
⚠️  Website your-site.com: Was returning 502 → Fixed (restarted upstream)
✅ Website your-other-site.com: Healthy (200)
✅ Disk space: 67% used

Actions taken: 1 fix
Fix log: ~/.openclaw/dms-fix-log.jsonl

安全使用建议

What to check before installing or running this skill: - Privileges: This skill expects to run many sudo commands (restart services, truncate logs, delete files). Only install on hosts where you are comfortable granting that level of control. Review and test each command in a safe environment first. - Missing/implicit dependencies: The SKILL.md uses helper tools (dms_recover, dms_status, openclaw CLI) and expects scripts under /usr/local/bin and a sudoers NOPASSWD entry. Those are not bundled. Confirm these tools/scripts exist and inspect them before use. - External services: The playbooks instruct the agent to query external MCPs (Tavily) and to optionally use ElevenLabs for voice alerts. Using those will send error messages and possibly other system output off-host. If you cannot or should not transmit such data, disable those steps or ensure API keys and the endpoints are trusted and auditable. - File operations: The disk playbook deletes older logs and truncates large logs and will prune Docker artifacts if present. Make backups if logs or images are important. Note the skill explicitly warns not to remove application code, DB files, SSH keys, and the fix log — still, review any destructive find/delete commands before running. - Cron/autonomy: The skill will create cron jobs when a recurrence threshold is met. If you want to avoid autonomous periodic fixes, deny cron creation or review cron rules it proposes. - Audit the scripts: Before allowing this skill to act, inspect the actual scripts it expects to run (e.g., /usr/local/bin/tailscale-funnel-start.sh, nginx-check.sh, process-restart.sh). The registry only contains playbooks; the real behavior depends on those local scripts. - Questions to ask the publisher or to verify locally: Where do dms_recover and dms_status come from? Where would ElevenLabs/Tavily credentials be stored and who has access? Are the suggested sudoers rules acceptable for your security posture? Do the /usr/local/bin scripts exist and are their contents safe? Given these gaps (undeclared helper tools, external queries, and sudo usage), treat this skill as high-impact and review/lock down environment and scripts before enabling automatic operation.

能力评估

ℹ Purpose & Capability

The name/description align with the actions in SKILL.md: it monitors services (nginx, tailscale), checks disk/processes, and runs recovery steps. Required binaries (tailscale, nginx, curl, systemctl) are reasonable for this purpose. Minor mismatch: the runtime instructions assume additional helper tools (dms_recover, dms_status, openclaw CLI, sudo) and scripts (/usr/local/bin/..., /usr/local/bin/openclaw-skills/*) that are not listed in the declared required binaries or manifest — this is an omission that can break the skill or hide extra dependencies.

⚠ Instruction Scope

The playbooks instruct the agent to run many privileged operations (sudo systemctl restart, journalctl, truncate logs, find -delete, docker prune) and to read many system paths (/etc/nginx, /var/log, ~/.openclaw). Those actions are coherent for recovery tasks but are high-impact. The instructions also direct the agent to query external MCPs (Tavily, ElevenLabs) with raw error text and to incorporate learned fixes into local playbooks — that means potentially sending potentially sensitive error/log content to external services. The skill does not declare or justify where credentials for those external services are stored.

✓ Install Mechanism

No install spec — instruction-only. Lowest file-write footprint from the registry side. However, the instructions assume preinstalled scripts and sudoers entries (e.g., tailscale-funnel-start.sh and sudoers.d entry) which are not included; those must already exist on the host for full functionality.

⚠ Credentials

The skill declares no required environment variables, yet it references external services (ElevenLabs, Tavily) and platform tooling (dms_recover, dms_status, openclaw) that normally require credentials or configuration. There is also an implicit need for elevated privileges (sudo) to restart services and modify logs. The lack of declared credentials/config paths makes it unclear how sensitive data or API keys would be handled or stored.

ℹ Persistence & Privilege

The skill is not forced-always and may be invoked manually or via cron it creates. It explicitly instructs creating cron jobs (openclaw cron add) when recurring failures are detected — this is consistent with its purpose but increases runtime autonomy and blast radius if abused. It also expects NOPASSWD sudoers entries for at least one tailscale script; adding such sudoers rules is a privileged action a user should review carefully.

版本历史

v0.1.0

Initial release of Dead Man's Switch, a self-healing infrastructure guardian: - Monitors essential services (Tailscale, websites, disk, process health) in a strict diagnostic order. - Executes automated recovery playbooks and logs every incident to aid future diagnosis and learning. - Detects recurring failures, triggers appropriate alerting and optional cron-based monitoring. - Provides clear summary notifications and (if configured) concise voice alerts after each run. - Learns from unknown errors by attempting web-sourced solutions and updating playbooks. - User-invocable, with support for manual and automated (cron) checks.

元数据

Slug deadmans-switch

版本 0.1.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Dead Man's Switch 是什么？

Self-healing infrastructure guardian. Monitors services, diagnoses failures, executes recovery playbooks, and learns from incidents. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 92 次。

如何安装 Dead Man's Switch？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install deadmans-switch」即可一键安装，无需额外配置。

Dead Man's Switch 是免费的吗？

是的，Dead Man's Switch 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Dead Man's Switch 支持哪些平台？

Dead Man's Switch 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（linux）。

谁开发了 Dead Man's Switch？

由 peres84（@peres84）开发并维护，当前版本 v0.1.0。

Dead Man's Switch