Description

Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes. When your agent breaks, know what happened, why, and...

README (SKILL.md)

\r \r

Incident Replay Agent Failure Forensics\r

Name: Incident Replay
Author: theshadowrose

\r Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes. When your agent breaks, know what happened, why, and how to prevent it.\r \r ---\r \r Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes.\r \r When your agent breaks, you need to know what happened, why, and how to prevent it next time. Incident Replay captures workspace state at points in time, detects when things go wrong, reconstructs the sequence of events, and classifies root causes with actionable remediation steps.\r \r ---\r \r

The Problem\r

\r Your agent crashed overnight. Files are missing. The config looks wrong. The logs are a wall of text. What happened? When? Why?\r \r Without forensics tooling, post-mortem analysis is manual detective work: diffing files by hand, grepping logs, guessing at causation. Incident Replay automates the mechanics so you can focus on understanding.\r \r

What It Does\r

\r

1. Capture (`incident_capture.py`)\r

Take point-in-time snapshots of your workspace (files, sizes, hashes, content)\r
Configurable include/exclude patterns (track what matters, ignore noise)\r
Automatic snapshot pruning (keep last N)\r
Compare any two snapshots to see exactly what changed\r
Trigger detection — automatically flag incidents based on:\r
- Log patterns (tracebacks, errors, fatal messages)\r
- File changes (unexpected deletions, config modifications)\r
- Content patterns (secrets in output, constraint violations)\r
- Empty output files\r \r

2. Replay (`incident_replay.py`)\r

Build chronological timelines from snapshots, file changes, and triggers\r
Extract decision chains from agent logs and memory files\r
Heuristic root cause classification:\r
- Config error — misconfiguration caused the failure\r
- Data corruption — input data was malformed or missing\r
- Drift — gradual workspace state degradation\r
- External failure — API/network/filesystem dependency failed\r
- Logic error — bug in agent logic or prompt\r
- Resource exhaustion — ran out of memory, disk, tokens, or time\r
Remediation suggestions tailored to each root cause category\r
Incident database with persistent storage and pattern tracking\r \r

3. Report (`incident_report.py`)\r

Full incident reports with timeline, changes, triggers, and remediation\r
Summary reports across all incidents with severity and root cause breakdowns\r
Decision chain visualisation (what the agent decided and why)\r
Export markdown or JSON\r \r ---\r \r

Quick Start\r

\r

# 1. Configure\r
cp config_example.json incident_config.json\r
# Edit workspace root, triggers, log patterns\r
\r
# 2. Take a baseline snapshot\r
python3 incident_capture.py --config incident_config.json --snapshot --label baseline\r
\r
# 3. ... agent does work, something breaks ...\r
\r
# 4. Take a post-incident snapshot\r
python3 incident_capture.py --config incident_config.json --snapshot --label post-incident\r
\r
# 5. See what changed\r
python3 incident_capture.py --config incident_config.json \\r
  --diff incident_data/snapshots/SNAP1.json incident_data/snapshots/SNAP2.json\r
\r
# 6. Check triggers\r
python3 incident_capture.py --config incident_config.json \\r
  --triggers incident_data/snapshots/SNAP1.json incident_data/snapshots/SNAP2.json\r
\r
# 7. Full analysis — creates an incident with timeline, root cause, remediation\r
python3 incident_replay.py --config incident_config.json \\r
  --analyze incident_data/snapshots/SNAP1.json incident_data/snapshots/SNAP2.json \\r
  --title "Agent crashed during deployment"\r
\r
# 8. Generate incident report\r
python3 incident_report.py --config incident_config.json --incident INC-0001\r
\r
# 9. View all incidents and patterns\r
python3 incident_replay.py --config incident_config.json --incidents\r
python3 incident_replay.py --config incident_config.json --patterns\r
python3 incident_report.py --config incident_config.json --summary\r
```\r
\r
## Programmatic Usage\r
\r
```python\r
from incident_capture import Capturer, Snapshot, _load_config\r
from incident_replay import Analyzer\r
\r
cfg = _load_config("incident_config.json")\r
cap = Capturer(cfg)\r
analyzer = Analyzer(cfg)\r
\r
# Take snapshots\r
before = cap.take_snapshot(label="before")\r
# ... agent runs ...\r
after = cap.take_snapshot(label="after")\r
\r
# Analyse\r
changes = cap.diff_snapshots(before, after)\r
triggers = cap.check_triggers(before, after)\r
decisions = analyzer.extract_decisions(after)\r
timeline = analyzer.build_timeline(\r
    [before, after],\r
    triggers=[t.to_dict() for t in triggers],\r
    changes=changes,\r
)\r
\r
# Create incident\r
incident = analyzer.create_incident(\r
    title="Agent failed during task X",\r
    timeline=timeline,\r
    triggers=[t.to_dict() for t in triggers],\r
    file_changes=changes,\r
    decisions=decisions,\r
)\r
print(f"Created {incident.id}: {incident.root_cause}")\r
```\r
\r
---\r
\r
## Use Cases\r
\r
- **Overnight failure analysis:** Agent ran unattended and broke — what happened?\r
- **Config change impact:** Track exactly what changed after a config update\r
- **Drift detection:** Compare weekly snapshots to catch gradual degradation\r
- **Secret leak detection:** Catch credentials or sensitive data in agent outputs\r
- **Regression forensics:** Agent used to work, now it doesn't — find the divergence point\r
- **Team incident management:** Track incidents over time, find recurring patterns\r
\r
## What's Included\r
\r
| File | Purpose |\r
|------|---------|\r
| `incident_capture.py` | State snapshot and change detection |\r
| `incident_replay.py` | Timeline reconstruction, analysis, incident management |\r
| `incident_report.py` | Report generation (markdown, JSON) |\r
| `config_example.json` | Full configuration template |\r
| `LIMITATIONS.md` | What this tool doesn't do |\r
| `LICENSE` | MIT License |\r
\r
## Requirements\r
\r
- Python 3.8+\r
- No external dependencies (stdlib only)\r
- Works on any OS\r
- Platform-agnostic (works with any file-based AI agent workspace)\r
\r
## Configuration\r
\r
See `config_example.json` for the complete reference. Key areas:\r
\r
- **`WORKSPACE_ROOT`** — Directory to monitor\r
- **`INCLUDE/EXCLUDE_PATTERNS`** — What files to capture\r
- **`TRIGGERS`** — Conditions that flag incidents (log patterns, file changes, content scans)\r
- **`ROOT_CAUSE_CATEGORIES`** — Classification categories with descriptions and remediation\r
- **`DECISION_MARKERS`** — Regex patterns to extract agent decisions from logs\r
- **`LOG_FILES`** — Which files to scan for decision chains\r
\r
---\r
\r
## quality-verified\r
\r
\r
## License\r
\r
MIT — See `LICENSE` file.\r
\r
\r
---\r
\r
\r
## ⚠️ Security Note — Config File\r
\r
Configuration is loaded from a JSON file. This is safe to share — no code execution.\r
\r
- Config path is validated for existence and size (1MB cap) before loading\r
- Must be a `.json` file — raises `ValueError` if given a non-JSON path\r
- Keep your config under version control; it defines what triggers are watched and what's protected\r
\r
## ⚠️ Disclaimer\r
\r
This software is provided "AS IS", without warranty of any kind, express or implied.\r
\r
**USE AT YOUR OWN RISK.**\r
\r
- The author(s) are NOT liable for any damages, losses, or consequences arising from \r
  the use or misuse of this software — including but not limited to financial loss, \r
  data loss, security breaches, business interruption, or any indirect/consequential damages.\r
- This software does NOT constitute financial, legal, trading, or professional advice.\r
- Users are solely responsible for evaluating whether this software is suitable for \r
  their use case, environment, and risk tolerance.\r
- No guarantee is made regarding accuracy, reliability, completeness, or fitness \r
  for any particular purpose.\r
- The author(s) are not responsible for how third parties use, modify, or distribute \r
  this software after purchase.\r
\r
By downloading, installing, or using this software, you acknowledge that you have read \r
this disclaimer and agree to use the software entirely at your own risk.\r
\r
\r
**DATA DISCLAIMER:** This software processes and stores data locally on your system. \r
The author(s) are not responsible for data loss, corruption, or unauthorized access \r
resulting from software bugs, system failures, or user error. Always maintain \r
independent backups of important data. This software does not transmit data externally \r
unless explicitly configured by the user.\r
\r
------\r
\r
## Support & Links\r
\r
| | |\r
|---|---|\r
| 🐛 **Bug Reports** | [email protected] |\r
| ☕ **Ko-fi** | [ko-fi.com/theshadowrose](https://ko-fi.com/theshadowrose) |\r
| 🛒 **Gumroad** | [shadowyrose.gumroad.com](https://shadowyrose.gumroad.com) |\r
| 🐦 **Twitter** | [@TheShadowyRose](https://twitter.com/TheShadowyRose) |\r
| 🐙 **GitHub** | [github.com/TheShadowRose](https://github.com/TheShadowRose) |\r
| 🧠 **PromptBase** | [promptbase.com/profile/shadowrose](https://promptbase.com/profile/shadowrose) |\r
\r
*Built with [OpenClaw](https://github.com/openclaw/openclaw) — thank you for making this possible.*\r
\r
---\r
\r
🛠️ **Need something custom?** Custom OpenClaw agents & skills starting at $500. If you can describe it, I can build it. → [Hire me on Fiverr](https://www.fiverr.com/s/jjmlZ0v)\r

Usage Guidance

This skill appears to do what it says: local forensic snapshots, diffs, trigger detection, analysis, and reporting using only the Python standard library. Before installing or running it: (1) set WORKSPACE_ROOT to the smallest useful directory (avoid running from '/', your home dir, or other broad roots), (2) tighten INCLUDE_PATTERNS/EXCLUDE_PATTERNS so you don't accidentally capture secrets or unrelated files, (3) confirm the DATA_DIR location and secure its permissions (incident data contains captured file contents), (4) review the code if you need assurance there are no outbound network calls (the provided files use only stdlib file/regex/json operations), and (5) run first in a sandbox or test workspace to validate behavior. If you need the agent to run this autonomously, remember autonomous invocation plus the ability to read the workspace increases the potential blast radius—only permit that if you trust the agent's policies and inputs.

Capability Analysis

Type: OpenClaw Skill Name: incident-replay Version: 1.0.6 The Incident Replay bundle is a legitimate forensic utility designed to capture workspace snapshots, reconstruct event timelines, and analyze AI agent failures. The implementation (incident_capture.py, incident_replay.py) relies exclusively on the Python standard library, contains no network-reaching code or shell execution commands, and includes proactive security features such as secret detection patterns and configuration file size validation. The behavior is entirely consistent with the stated purpose of post-mortem analysis.

Capability Assessment

✓ Purpose & Capability

Name/description (post-mortem forensics) align with the code and SKILL.md. The package only uses filesystem, hashing, regex, and JSON storage to capture snapshots, build timelines, classify root causes, and generate reports — all coherent with forensic intent.

ℹ Instruction Scope

SKILL.md instructs the agent to read workspace files, take snapshots, diff them, scan logs for patterns (including API key/password patterns), and write incident data and reports locally. This is expected for a forensic tool, but it means the skill will capture the contents of included files (by default *.py, *.md, *.txt, *.json, logs). Review and tighten include/exclude patterns and WORKSPACE_ROOT before use to avoid capturing unrelated sensitive files.

✓ Install Mechanism

No install spec; it's an instruction-and-code skill relying on Python stdlib. Nothing is downloaded or executed from remote URLs, and no third-party packages are pulled in.

✓ Credentials

The skill requests no environment variables or external credentials. The default config looks broad (captures many text file types), which is reasonable for forensic analysis but should be tuned to avoid unnecessary exposure of secrets.

ℹ Persistence & Privilege

The skill persists snapshots, incidents, and reports under a configurable DATA_DIR (defaults to incident_data). It is not 'always' enabled and does not modify other skills. Because it can read and store file contents locally, run it with a safe WORKSPACE_ROOT and tuned include/exclude patterns; ensure appropriate filesystem permissions and backups for the incident_data directory.

Version History

v1.0.6

- No file or documentation changes in this version. - Version number remains at 1.0.5 in the documentation, with no updates for 1.0.6 reflected. - No new features, fixes, or updates introduced.

v1.0.5

**Configuration and usability update: now uses JSON config files instead of Python for safer and simpler setup.** - Switched configuration format from Python (`.py`) to JSON (`.json`) for increased security and shareability. - Added `config_example.json` as the new configuration template; removed legacy Python config from quick start and examples. - Updated README, SKILL.md, and all usage instructions to reference `.json` config files throughout. - Enhanced config file safety: explicit size cap, extension check, and no code execution. - Documentation and examples revised for clarity and JSON-based setup.

v1.0.3

- Updated the skill name to remove encoding issues (“�” character). - Bumped version number from 1.0.2 to 1.0.3. - No functional or code changes; documentation update only.

v1.0.2

- Added detailed documentation and usage instructions to SKILL.md. - Expanded feature descriptions for incident capture, replay, and reporting. - Provided example commands for common workflows and programmatic usage. - Highlighted key use cases and configuration advice. - Included explicit security and disclaimer sections regarding config file execution and usage risks.

Metadata

Slug incident-replay

Version 1.0.6

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 4

Frequently Asked Questions

What is Incident Replay?

Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes. When your agent breaks, know what happened, why, and... It is an AI Agent Skill for Claude Code / OpenClaw, with 345 downloads so far.

How do I install Incident Replay?

Run "/install incident-replay" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Incident Replay free?

Yes, Incident Replay is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Incident Replay support?

Incident Replay is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Incident Replay?

It is built and maintained by Shadow Rose (@theshadowrose); the current version is v1.0.6.

More Skills

Incident Replay