Resilience Monitor
/install resilience-monitor
Resilience Skill
LLM API error tracking, classification, retry, and task recovery for OpenClaw.
Overview
This skill provides visibility into API call health and automated retry management. Use it to:
- Monitor API error rates and patterns
- View per-model performance statistics
- Configure retry strategies
- Generate error reports
- Track task recovery status
Tools
resilience_dashboard
Open the live web dashboard in your browser for real-time error stats and retry strategy management.
Parameters:
action:"open"(default) |"status"|"stop"
Features:
- Live error overview (today / hour / active retries)
- Model breakdown table
- Recent errors feed
- Retry strategy cards — set default, adjust max retries
- Auto-refresh: 5s, 60s, 5min, 1h, or off
URL: http://127.0.0.1:18765/ (default port, configurable via dashboardPort)
Voice / natural language examples:
- "打开错误统计页面" →
resilience_dashboard({ action: "open" }) - "打开监控面板" →
resilience_dashboard({ action: "open" }) - "打开 resilience 面板" →
resilience_dashboard({ action: "open" })
The dashboard starts automatically when OpenClaw Gateway starts (unless dashboardEnabled: false).
Configuration lives in ~/.openclaw/openclaw.json under plugins.entries.resilience.config (not only api.pluginConfig at hook time). Example:
"resilience": {
"enabled": true,
"config": {
"dashboardPort": 18765,
"dashboardEnabled": true,
"instanceLabel": "my-workspace"
}
}
At gateway_start, config is read from ctx.config + ctx.workspaceDir.
Multi-instance: Use the instance dropdown to view all instances (aggregated) or a single Gateway. Each instance stores data under ~/.openclaw/plugins/resilience/instances/\x3Cid>/. Strategy edits apply only to the local Gateway instance.
resilience_stats
View API error statistics by time period or model.
Parameters:
query(optional): Natural language query"today"or empty — today's full summary"hour"— current hour stats"week"— current week stats- Any model name (e.g.,
"mimo-v2.5") — model-specific stats
Examples:
- "查看今天报错统计" →
resilience_stats({ query: "today" }) - "查看 mimo-v2.5 的错误率" →
resilience_stats({ query: "mimo-v2.5" }) - "查看本周错误率" →
resilience_stats({ query: "week" })
resilience_strategies
View, add, update, or reset retry strategies.
Parameters:
action:"list"(default) |"add"|"update"|"reset"strategyName: Strategy name (required for add/update)updates: Fields to update (for add/update)
Examples:
- "查看当前所有策略配置" →
resilience_strategies({ action: "list" }) - "修改超时重试策略为指数退避" →
resilience_strategies({ action: "update", strategyName: "default-exponential", updates: { type: "exponential" } }) - "添加一个自定义重试策略" →
resilience_strategies({ action: "add", strategyName: "my-strategy", updates: { type: "custom", maxRetries: 3, intervals: [60000, 300000, 600000] } }) - "重置策略为默认" →
resilience_strategies({ action: "reset" })
resilience_report
Generate detailed error reports.
Parameters:
reportType:"daily"(default) |"model"|"recovery"|"full"target: Model name or date (YYYY-MM-DD)
Examples:
- "生成今日错误日报" →
resilience_report({ reportType: "daily" }) - "查看 mimo-v2.5 的详细报告" →
resilience_report({ reportType: "model", target: "mimo-v2.5" }) - "查看任务恢复状态" →
resilience_report({ reportType: "recovery" }) - "生成完整状态报告" →
resilience_report({ reportType: "full" })
Error Categories
| Category | Description | Retryable |
|---|---|---|
rate_limit |
429 Too Many Requests | ✅ |
server_overload |
503 Service Unavailable | ✅ |
timeout |
Request timeout | ✅ |
auth_failed |
401/403 Authentication failed | ❌ |
network_error |
Connection errors | ✅ |
model_unavailable |
Model not found or offline | ✅ |
context_too_long |
Context length exceeded | ❌ |
unknown |
Unclassified errors | ❌ |
Retry Strategies
Strategy Types
- fixed: Fixed interval between retries (e.g., every 30s)
- exponential: Exponential backoff (1min → 2min → 4min → 8min...)
- custom: User-defined interval schedule (e.g., [1min, 3min, 5min, 15min])
Default Strategies
| Name | Type | Max Retries | Intervals | Error Types |
|---|---|---|---|---|
| default-exponential | exponential | 5 | 1m→15m | rate_limit, server_overload, timeout, network_error |
| rate-limit-fixed | fixed | 3 | 30s | rate_limit |
| model-backoff | custom | 6 | 1m→2h | server_overload, model_unavailable |
| network-retry | exponential | 4 | 5s→1m | network_error |
Data Storage
Per-instance data: ~/.openclaw/plugins/resilience/instances/\x3Cinstance-id>/ (stats, logs, strategies, tasks). Legacy root layout is still read as default.
~/.openclaw/plugins/resilience/instances/\x3Cinstance-id>/
├── meta.json
├── stats.json
├── strategies.json
├── active-retries.json
├── logs/YYYY-MM-DD.jsonl
└── tasks/
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install resilience-monitor - 安装完成后,直接呼叫该 Skill 的名称或使用
/resilience-monitor触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Resilience Monitor 是什么?
Monitor and manage OpenClaw API errors, track model performance, configure retry strategies, generate reports, and oversee task recovery status. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 16 次。
如何安装 Resilience Monitor?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install resilience-monitor」即可一键安装,无需额外配置。
Resilience Monitor 是免费的吗?
是的,Resilience Monitor 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Resilience Monitor 支持哪些平台?
Resilience Monitor 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Resilience Monitor?
由 leiJack-lo(@leijack-lo)开发并维护,当前版本 v0.3.0。