Description

Provides a structured 7-phase process to investigate, diagnose, restore, prevent, monitor, and document OpenClaw system failures including config loss, crash...

README (SKILL.md)

Incident Response

Name: Incident Response
Author: chunhualiao

Seven phases, in order. Never skip. Never assume — follow the evidence.

Outputs produced by this skill:

Root cause statement (5 Whys chain with evidence citations)
Restore confirmation (what was restored, verified working)
Prevention commit (git commit hash of guard/rule added)
Monitoring cron (job ID + schedule)
Learning entry (appended to ~/.openclaw/learnings/rules.md)

Phase 0: Triage (2 min)

Check current state FIRST before investigating history.

# Is it actually broken right now?
openclaw status
ssh "\x3Cremote-host>" "launchctl list | grep openclaw"
# Test with correct protocol (check source: HTTP vs HTTPS?)

If currently working → report "recovered, investigating cause." If still broken → proceed.

Phase 1: Evidence Collection

Gather hard evidence from four sources:

1a. Config backups timeline

# See binding/setting counts over time
ssh "\x3Cremote-host>" "python3 \x3C\x3C 'EOF'
import json, glob, os
for f in sorted(glob.glob('~/.openclaw/config-backups/openclaw-*.json'), key=os.path.getmtime):
    d = json.load(open(f))
    import datetime
    dt = datetime.datetime.fromtimestamp(os.path.getmtime(f)).strftime('%Y-%m-%d %H:%M')
    # Customize: bindings, agents, channels, etc.
    count = len(d.get('bindings', []))
    ids = [b.get('agentId') for b in d.get('bindings', [])]
    print(f'{dt} [{count}] {ids}')
EOF"

1b. Git audit trail

ssh "\x3Cremote-host>" "cd ~/.openclaw && git log --oneline -20"
ssh "\x3Cremote-host>" "cd ~/.openclaw && git diff \x3Ccommit-a> \x3Ccommit-b> -- openclaw.json | grep '^[+-]' | grep -v '^---\|^+++'"

1c. Session logs (who did what)

# Find sessions that touched the broken config key
ssh "\x3Cremote-host>" "rg -rl 'keyword' ~/.openclaw/agents/*/sessions/*.jsonl | head -5"

# Extract tool calls from a session
ssh "\x3Cremote-host>" "python3 \x3C\x3C 'EOF'
import json
for line in open('SESSION.jsonl'):
    obj = json.loads(line)
    if obj.get('type') != 'message': continue
    for block in obj.get('message',{}).get('content',[]):
        if block.get('type') == 'toolCall' and block.get('name') in ['Write','Edit','gateway','exec']:
            print(obj['timestamp'], block['name'], str(block.get('input',''))[:200])
EOF"

1d. Config backup diff (find the exact moment of change)

# Compare before/after a suspicious backup
python3 -c "
import json
a = json.load(open('backup-before.json'))
b = json.load(open('backup-after.json'))
# Compare specific field
print('Before:', a.get('bindings'))
print('After:', b.get('bindings'))
"

Stop and document: Who changed what, when, which session, which tool call.

Phase 2: 5 Whys Analysis

Write each "why" as a statement of fact backed by evidence from Phase 1.

Why 1: [Symptom] — e.g. "Bindings dropped from 17 to 1"
  Evidence: backup timestamp + count

Why 2: [Immediate cause] — e.g. "A full config replacement was written at 09:38 PST"
  Evidence: backup mtime + content diff

Why 3: [Mechanism] — e.g. "the agent wrote a new config from scratch, not from current config"
  Evidence: session log tool call + content

Why 4: [System gap] — e.g. "config-validate.sh --merge had no guard against binding count drops"
  Evidence: script inspection showing no such check

Why 5: [Root cause] — e.g. "No automated detection existed between when the config was written and the next user report"
  Evidence: no monitoring cron, no git at the time

Rule: Every "why" must cite a specific file, log entry, timestamp, or command output. No assumptions.

Phase 3: Restore

Restore to last known-good state using backup timeline from Phase 1.

# Restore specific fields (always merge, never replace)
PATCH=$(python3 -c "
import json
good = json.load(open('/path/to/good-backup.json'))
patch = {'bindings': good['bindings']}  # customize field
print(json.dumps(patch))
")
echo "$PATCH" | ssh "\x3Cremote-host>" "~/.openclaw/scripts/config-validate.sh --merge"

# Restart gateway
ssh "\x3Cremote-host>" "launchctl stop ai.openclaw.gateway && sleep 2 && launchctl start ai.openclaw.gateway"
ssh "\x3Cremote-host>" "launchctl list | grep ai.openclaw.gateway"  # verify exit code 0

Verify restore: Check that the restored value matches the good backup. Re-run the user's original failing action.

Phase 4: Prevention

Add guards proportional to the severity and recurrence risk. See references/prevention-patterns.md for full patterns. Quick reference:

For config fields that must not decrease: Add guard to config-validate.sh --merge (see references for template)

For agent behavior rules: Add to ~/.openclaw/agents/\x3Cid>/agent/SOUL.md as a Hard Rule (HR-NNN)

For recurring mistakes: Add to ~/.openclaw/learnings/rules.md with category and date

For schema validation gaps: Update config-validate.sh valid_keys list after verifying against DeepWiki

Always commit prevention changes to git:

ssh "\x3Cremote-host>" "cd ~/.openclaw && git add -A && git commit -m 'prevention: \x3Cwhat was added> after \x3Cincident>'"

Phase 5: Monitor

Set a recurring cron job that runs until user confirms "good enough" (minimum 7 days, 30 days for recurring incidents).

Cron job structure:
- Schedule: every 24h (or every N hours for high-severity)
- Task: check specific metric → compare to baseline → if degraded: restore + 5-why → report
- Report channel: sessions_send to your preferred channel (Signal, Telegram, Discord)
- Auto-escalate: if same fix needed 3+ days in a row → upgrade prevention measure
- Termination: user explicitly says "stop monitoring" or N days without incident

See references/cron-template.md for the full cron job prompt template.

Phase 6: Document

Write to ~/.openclaw/learnings/rules.md if a Hard Rule should be added:

Category: HR (Hard Rule, recurring) or SR (Soft Rule, first offense)
Include: what triggered, what the rule is, date learned, why it matters

Update MEMORY.md with incident summary if it's systemic.

Configuration

No persistent configuration required. Adapt the following to your environment:

Variable	Description	Example
Remote host	SSH target for remote investigations	`\x3Cremote-host>` → your Titan/server hostname
Config backup path	Where OpenClaw stores automatic config backups	`~/.openclaw/config-backups/`
Session key	Your messaging session key for cron reports	`agent:main-signal:signal:\x3Cyour-number>`
Learnings path	Where rules are persisted	`~/.openclaw/learnings/rules.md`

See references/cron-template.md for full cron report configuration.

Quick Diagnosis Checklists

See references/checklists.md for:

Gateway crash checklist
Binding loss checklist
Config key disappeared checklist
Agent routing wrong checklist
Vector search not finding content checklist

Usage Guidance

This skill is coherent with its stated purpose (incident response) but carries high operational power. Before installing or running it: 1) Verify the target host OS and adjust commands (launchctl is macOS; on Linux you may need systemctl). 2) Only grant exec/SSH and filesystem access to trusted agents — the skill reads session logs and config backups and can restart services and commit config changes. 3) Test the workflow in a staging environment first (especially restore/merge and cron templates). 4) Confirm how reporting is configured: replace sessions_send(sessionKey='<your-session-key>') placeholders with a safe, internal reporting channel — avoid sending sensitive outputs to external endpoints. 5) Ensure backups are taken before applying the restore/merge steps and that git commits and chmod changes are reviewed. 6) If you rely on DeepWiki or other helper scripts referenced in prevention patterns, install and validate those tools separately. If you want, I can list the exact lines/commands that will modify files or restart services so you can review them one-by-one.

Capability Analysis

Type: OpenClaw Skill Name: incident-response Version: 1.0.0 The skill bundle requests high-privilege capabilities, including SSH access, cron job management, and the ability to read all historical session logs and configuration backups. While these functions are aligned with the stated purpose of 'Incident Response,' the skill provides instructions for the agent to exfiltrate system status and audit data to external messaging platforms (Signal, Telegram, Discord) as detailed in 'references/cron-template.md'. Additionally, the shell and Python snippets in 'SKILL.md' and 'references/prevention-patterns.md' lack input sanitization, which could be leveraged for command injection if the agent processes malicious user input during an investigation.

Capability Assessment

ℹ Purpose & Capability

The name/description (incident response for OpenClaw) matches the declared permissions and the runnable commands: SSH, git, python3, read config backups, restart gateway, and add prevention rules. That said, there is a notable mismatch: skill.yml declares runtime: linux but many runtime commands use macOS's launchctl (launchctl stop/start ai.openclaw.gateway). This OS mismatch could lead to confusing or harmful behavior if run on the wrong platform. The prevention patterns also reference a DeepWiki helper (~/.openclaw/skills/deepwiki/scripts/deepwiki.sh) which is assumed present but not declared as a dependency.

ℹ Instruction Scope

SKILL.md instructs the agent to run many sensitive operations (SSH into hosts, read session JSONL logs that may contain user messages or secrets, edit config files, git commit, change file permissions, and restart services). Those actions are appropriate for incident response, but the instructions also include placeholders and templates that could be misused if not filled carefully (e.g., sessions_send(sessionKey='<your-session-key>'), ssh "<remote-host>"). The skill requires reading and sometimes writing sensitive local files (config backups, session logs) — this is within scope but high-privilege. The file-read/write recommendations are explicit and not covert.

✓ Install Mechanism

There is no install spec and no code files that execute on install; this is an instruction-only skill. That minimizes install-time risk because nothing is downloaded or written automatically.

✓ Credentials

The skill does not request environment variables or external credentials. It does require exec/SSH access and filesystem read/write permissions which are appropriate for an on-host incident response tool. Ensure the agent invoking this skill has only the required host access (least privilege).

✓ Persistence & Privilege

always:false and background_eligible:false; the skill is user-invocable and will not be force-included in every agent run. It will create/commit prevention rules and schedule cron jobs as part of normal operations (documented), which is expected for a remediation workflow.

Version History

v1.0.0

v1.0.0: Initial release. 7-phase structured IR workflow built from real production incidents — binding loss, gateway crashes, config regressions.

Metadata

Slug incident-response

Version 1.0.0

License —

All-time Installs 7

Active Installs 7

Total Versions 1

Frequently Asked Questions

What is Incident Response?

Provides a structured 7-phase process to investigate, diagnose, restore, prevent, monitor, and document OpenClaw system failures including config loss, crash... It is an AI Agent Skill for Claude Code / OpenClaw, with 504 downloads so far.

How do I install Incident Response?

Run "/install incident-response" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Incident Response free?

Yes, Incident Response is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Incident Response support?

Incident Response is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Incident Response?

It is built and maintained by Chunhua Liao (@chunhualiao); the current version is v1.0.0.

More Skills

Incident Response

Incident Response

Phase 0: Triage (2 min)

Phase 1: Evidence Collection

1a. Config backups timeline

1b. Git audit trail

1c. Session logs (who did what)

1d. Config backup diff (find the exact moment of change)

Phase 2: 5 Whys Analysis

Phase 3: Restore

Phase 4: Prevention

Phase 5: Monitor

Phase 6: Document

Configuration

Quick Diagnosis Checklists

What is Incident Response?

How do I install Incident Response?

Is Incident Response free?

Which platforms does Incident Response support?

Who created Incident Response?

💬 Comments