← 返回 Skills 市场
Incident Runbook Templates
作者
Solomon Neas
· GitHub ↗
· v1.0.1
· MIT-0
214
总下载
0
收藏
0
当前安装
2
版本数
在 OpenClaw 中安装
/install incident-runbook-templates
功能描述
Production-ready incident response runbook templates. Step-by-step procedures for detection, triage, mitigation, resolution, and communication. Includes esca...
使用说明 (SKILL.md)
Incident Runbook Templates
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
Do not use this skill when
- The task is unrelated to incident runbook templates
- You need a different domain or tool outside this scope
Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open
resources/implementation-playbook.md.
Use this skill when
- Creating incident response procedures
- Building service-specific runbooks
- Establishing escalation paths
- Documenting recovery procedures
- Responding to active incidents
- Onboarding on-call engineers
Core Concepts
1. Incident Severity Levels
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage, data loss | 15 min | Production down |
| SEV2 | Major degradation | 30 min | Critical feature broken |
| SEV3 | Minor impact | 2 hours | Non-critical bug |
| SEV4 | Minimal impact | Next business day | Cosmetic issue |
2. Runbook Structure
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix
Runbook Templates
Template 1: Service Outage Runbook
# [Service Name] Outage Runbook
## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall
## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?
## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate \x3C 95%` (PagerDuty)
### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)
## Initial Triage (First 5 Minutes)
### 1. Assess Scope
```bash
# Check service health
kubectl get pods -n payments -l app=payment-service
# Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
2. Quick Health Checks
- Can you reach the service?
curl -I https://api.company.com/payments/health - Database connectivity? Check connection pool metrics
- External dependencies? Check Stripe, bank API status
- Recent changes? Check deploy history
3. Initial Classification
| Symptom | Likely Cause | Go To Section |
|---|---|---|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
Mitigation Procedures
4.1 Service Completely Down
# Step 1: Check pod status
kubectl get pods -n payments
# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100
# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments
# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10
# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments
4.2 High Latency
# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool
# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments
4.3 Partial Failures (Specific Errors)
# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
grep -i error | sort | uniq -c | sort -rn | head -20
# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments
# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
# Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
SELECT * FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"
4.4 Traffic Surge
# Step 1: Check current request rate
kubectl top pods -n payments
# Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20
# Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
RATE_LIMIT_ENABLED=true \
RATE_LIMIT_RPS=1000 -n payments
# Step 4: If attack, block suspicious IPs
kubectl apply -f - \x3C\x3CEOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-suspicious
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # Suspicious range
EOF
Verification Steps
# Verify service is healthy
curl -s https://api.company.com/payments/health | jq
# Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
# Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
# Smoke test critical flows
./scripts/smoke-test-payments.sh
Rollback Procedures
# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments
# Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION
# Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'
Escalation Matrix
| Condition | Escalate To | Contact |
|---|---|---|
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |
Communication Templates
Initial Notification (Internal)
🚨 INCIDENT: Payment Service Degradation
Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]
Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards
Updates in #payments-incidents
Status Update
📊 UPDATE: Payment Service Incident
Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes
Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas
Next Steps:
- Continuing to monitor
- Root cause analysis in progress
ETA to Resolution: ~15 minutes
Resolution Notification
✅ RESOLVED: Payment Service Incident
Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4
Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully
Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress
### Template 2: Database Incident Runbook
```markdown
# Database Incident Runbook
## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;
-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start \x3C now() - interval '10 minutes';
Replication Lag
-- Check lag on replica
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;
-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable
Disk Space Critical
# Check disk usage
df -h /var/lib/postgresql/data
# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"
# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"
# If emergency, delete old data or expand disk
## Best Practices
### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress
### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident
## Resources
- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
安全使用建议
This appears to be a legitimate set of incident runbook templates, but be aware: the templates include concrete, privileged commands and reference secrets and internal endpoints that the skill does not declare. If you plan to let an agent execute steps derived from these templates, do not allow autonomous execution against production systems. Confirm required credentials (DB user/host, API tokens) separately, run the playbooks in a staging environment first, and require explicit human approval before running any rollback, DB-termination, or network-policy apply commands. If you only want passive guidance, restrict the agent to return instructions rather than execute commands, or modify the skill to explicitly declare any environment variables/permissions it needs.
功能分析
Type: OpenClaw Skill
Name: incident-runbook-templates
Version: 1.0.1
The skill bundle provides standard incident response runbook templates and SRE best practices. The included commands (kubectl, psql, curl) are appropriate for the stated purpose of service triage and mitigation, and there is no evidence of malicious intent, data exfiltration, or prompt injection in SKILL.md.
能力评估
Purpose & Capability
The name and description match the content: this is a library of incident response runbook templates with concrete triage and mitigation steps. The included commands (kubectl, psql, curl, feature-flag APIs) are appropriate for runbooks. However, many of the steps assume privileged access and side-effecting operations (rollbacks, scaling, terminating DB backends) which are stronger capabilities than a passive template; the skill does not state these operational requirements explicitly.
Instruction Scope
SKILL.md contains explicit shell commands and procedural steps that reference system tools (kubectl, psql, curl, grep, kubectl apply) and internal endpoints (prometheus, grafana, api.company.com, Sentry, Stripe). It also references environment variables ($DB_HOST, $DB_USER) and a local resources/implementation-playbook.md file that are not declared. The instructions include destructive/privileged actions (pg_terminate_backend, kubectl rollout undo, applying NetworkPolicy) which go beyond read-only guidance and could cause production impact if executed.
Install Mechanism
No install spec and no code files — this is instruction-only. That minimizes direct filesystem or network install risk because nothing is downloaded or executed as part of an installation step.
Credentials
The skill declares no required environment variables or credentials, but the runbook examples reference credentials and variables (DB host/user, API endpoints, internal auth) and call APIs (Stripe, internal feature-flag endpoints, PagerDuty/Slack references) that would normally require secrets. The lack of declared env requirements is inconsistent with the operational commands provided.
Persistence & Privilege
always:false and no special persistence is requested, which is appropriate. However, the skill allows model invocation (platform default). Combined with the instruction scope concerns — i.e., explicit commands that can change production state — you should be cautious about allowing autonomous invocation or automatic execution of these runbook steps without human review.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install incident-runbook-templates - 安装完成后,直接呼叫该 Skill 的名称或使用
/incident-runbook-templates触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.1
Natural description rewrite
v1.0.0
Initial release of incident-runbook-templates, offering structured guidance for managing incidents.
- Provides production-ready templates covering detection, triage, mitigation, resolution, and communication steps for incident response.
- Defines incident severity levels and recommended response times.
- Includes a comprehensive service outage runbook with actionable checklists and command examples for various outage scenarios.
- Supplies templates for impact assessment, verification, rollback, escalation, and communication during active incidents.
- Guides onboarding of on-call engineers and establishes clear escalation paths.
元数据
常见问题
Incident Runbook Templates 是什么?
Production-ready incident response runbook templates. Step-by-step procedures for detection, triage, mitigation, resolution, and communication. Includes esca... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 214 次。
如何安装 Incident Runbook Templates?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install incident-runbook-templates」即可一键安装,无需额外配置。
Incident Runbook Templates 是免费的吗?
是的,Incident Runbook Templates 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Incident Runbook Templates 支持哪些平台?
Incident Runbook Templates 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Incident Runbook Templates?
由 Solomon Neas(@solomonneas)开发并维护,当前版本 v1.0.1。
推荐 Skills