← 返回 Skills 市场
chefboyrdave21

ITIL Ops

作者 chefboyrdave21 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
141
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install itil-ops
功能描述
ITIL-aligned incident, problem, and change management for AI agents. Use when: detecting service crashes, analyzing recurring failures, tracking incidents to...
使用说明 (SKILL.md)

ITIL Ops — IT Service Management for AI Agents

Structured incident, problem, and change management adapted from ITIL 4 for autonomous agent operations.

Core Concepts

Severity Levels

Level Meaning Response Example
P1 Critical — service down, data at risk Immediate alert + auto-remediate Crash loop, disk full, OOM
P2 High — degraded service Alert within 1h Service restarts, auth failures
P3 Medium — non-critical issue Next review cycle Cron timeouts, broken files
P4 Low — cosmetic/minor Track, fix when convenient Log warnings, config drift

Incident vs Problem vs Change

  • Incident: Something broke. Restore service ASAP. (reactive)
  • Problem: Pattern of incidents. Find and fix root cause. (proactive)
  • Change: Planned modification. Assess risk before executing. (controlled)

Incident Management

Detection Sources

Scan these in order of criticality:

  1. Service crashesjournalctl --user -u SERVICE --since "12 hours ago" for watchdog timeouts, SIGABRT, SIGSEGV, core dumps
  2. Cron failures — consecutive error count > 2 in job state files
  3. Health endpoints — HTTP health checks returning non-200
  4. Resource pressure — disk > 80%, RAM > 80%, swap active
  5. Data integrity — schema validation failures, broken files, load errors

Detection Script

Run scripts/itil-review.sh to scan all sources. It outputs:

  • ITIL_CLEAR if nothing found (reply HEARTBEAT_OK)
  • Formatted report with incidents and problems if issues detected

Incident Lifecycle

DETECTED → CLASSIFIED (P1-P4) → DIAGNOSED → RESOLVED → CLOSED
                                      ↓
                              (3+ occurrences)
                                      ↓
                              ESCALATE TO PROBLEM

Auto-Classification Rules

# P1 — Critical
- Service crash count >= 3 in 12h (crash loop)
- Disk usage >= 90%
- RAM usage >= 90%
- Data loss detected

# P2 — High
- Service crashed 1-2 times
- 3+ services down simultaneously
- Auth/token failures affecting operations
- Cron job with 5+ consecutive failures

# P3 — Medium
- Broken data files (schema violations)
- Memory load errors > 10 in 12h
- Cron job with 3-4 consecutive failures
- Disk usage 80-89%

# P4 — Low
- 1 service down (non-critical)
- Config warnings
- Log noise

Creating Incident Tickets

When incidents are found, create coordination tasks:

Title: [ITIL-INC] \x3Cbrief description>
Body:
- Severity: P1/P2/P3/P4
- Category: service|cron|memory|disk|security
- Detected: \x3Ctimestamp>
- Detail: \x3Cwhat happened>
- Impact: \x3Cwhat's affected>
- Action: \x3Cwhat to do>

Problem Management

Pattern Detection

An incident becomes a problem when:

  • Same error occurs 3+ times in 24h
  • Same incident type recurs across 2+ review cycles
  • Multiple related incidents share a common root cause

Root Cause Analysis (RCA)

When a problem is identified:

  1. Gather evidence — journal logs, error messages, state files, recent changes
  2. Timeline — reconstruct the sequence of events
  3. 5 Whys — ask why iteratively until you reach the actual root cause
  4. Fix classification:
    • Quick fix — config change, file repair, timeout bump
    • Code fix — bug in script or daemon, needs PR
    • Architecture fix — design flaw, needs redesign

Problem Ticket Format

Title: [ITIL-PRB] \x3Croot cause description>
Body:
- Related incidents: \x3Clist>
- Root cause: \x3Cwhat's actually broken>
- Evidence: \x3Clogs, patterns, data>
- Fix applied: \x3Cimmediate remediation>
- Fix needed: \x3Cpermanent solution>
- Prevention: \x3Chow to prevent recurrence>

Known Error Database

Track resolved problems in state file (itil-state.json):

{
  "last_review": "2026-03-22T04:19:50Z",
  "last_incident_count": 2,
  "last_problem_count": 1,
  "known_errors": {
    "memory-content-dict": {
      "description": "Scripts writing content as dict instead of string",
      "root_cause": "Missing json.dumps() in memory file writers",
      "fix": "Wrap content in json.dumps() before saving",
      "fixed_date": "2026-03-22"
    }
  }
}

Change Management

Pre-Change Checklist

Before modifying services, configs, or infrastructure:

  1. What's changing? — specific files, services, configs
  2. Why? — linked incident/problem ticket
  3. Risk? — what could go wrong
  4. Rollback plan? — how to undo if it breaks
  5. Test? — how to verify it worked
  6. Notify? — does the human need to know

Change Categories

Type Approval Example
Standard Pre-approved, just do it Restart service, bump timeout
Normal Inform human, wait for OK New cron job, config change
Emergency Fix now, inform after Service down, data at risk

Post-Change Verification

After any change:

  1. Check service status — systemctl --user status SERVICE
  2. Watch logs for 60s — journalctl --user -u SERVICE -f --since "now"
  3. Run health check — scripts/itil-review.sh
  4. Verify no new errors in first 5 minutes

Event Management

Log Monitoring Patterns

# Service crashes
journalctl --user -u SERVICE --since "12h ago" | grep -ciE "watchdog timeout|killed|SIGABRT|SIGSEGV|failed with"

# Memory/resource issues
journalctl --user -u SERVICE --since "12h ago" | grep -c "Failed to load"

# Auth failures
journalctl --user -u SERVICE --since "12h ago" | grep -ciE "unauthorized|403|token expired|auth fail"

Health Check Endpoints

Check services with curl:

curl -sf --max-time 5 "$URL" >/dev/null 2>&1 || echo "DOWN"

Configure endpoints in the review script for your environment.

Continual Improvement

Review Cadence

Review Frequency Purpose
Incident review Every 12h Detect and classify new issues
Problem review Weekly Identify patterns, track RCA progress
Capacity review Weekly Disk, RAM, memory count trends
Process review Monthly Are our detection rules catching real issues?

KPIs to Track

  • MTTR (Mean Time to Resolve) — how fast do we fix incidents?
  • Incident recurrence rate — are the same things breaking?
  • False positive rate — are we alerting on non-issues?
  • Known error resolution — are problems getting permanent fixes?

State Tracking

The review script maintains itil-state.json with:

  • Last review timestamp and results
  • Incident/problem counts per review
  • System metrics (disk, RAM, restart count)
  • Cross-review pattern detection data

Cron Setup

Recommended Schedule

# Incident review — every 12 hours
openclaw cron add --name "itil-review" --every "12h" \
  --model "anthropic/claude-sonnet-4-6" --timeout-seconds 180 \
  --session isolated \
  --message "Run ITIL review: bash ~/.skcapstone/agents/lumina/scripts/itil-review.sh"

# Weekly problem review (Sunday 9 AM)
# Analyze the week's incidents, identify patterns, suggest improvements

File Structure

itil-ops/
├── SKILL.md              # This file
├── scripts/
│   └── itil-review.sh    # Main review script (scan + classify + report)
└── references/
    └── itil4-agent-mapping.md  # ITIL 4 → Agent operations reference

Integration Points

  • Coordination tasksskcapstone coord create for incident/problem tickets
  • Memory snapshotsskmemory_snapshot to record resolutions for future reference
  • Heartbeat — integrate with existing heartbeat to run lightweight checks
  • Cron — scheduled reviews via OpenClaw cron system
  • Alerting — Telegram/Discord delivery for P1/P2 issues
安全使用建议
This skill appears to do what it says: local incident detection, classification, and ticketing for an agent environment. Before installing, review the script (scripts/itil-review.sh) line-by-line and confirm you are comfortable with it reading your agent's memory and coordination directories and writing its state file (itil-state.json). If you have sensitive data in agent memory or task files, consider restricting file permissions or running the script with a least-privilege account. Also confirm whether any additional code (not included here) performs automatic remediation — the shipped script mainly detects and records issues, while the docs reference "auto-remediate" and autonomous fixes; if you need or want automatic changes, explicitly audit that logic and limit what can be changed without human approval.
功能分析
Type: OpenClaw Skill Name: itil-ops Version: 1.0.0 The itil-ops skill implements ITIL-aligned monitoring and incident management for OpenClaw agents. The core logic in scripts/itil-review.sh performs standard system health checks, including log analysis via journalctl, resource monitoring (df, /proc/meminfo), and local service heartbeats via curl to localhost. All actions are consistent with the stated purpose of operational oversight, and there is no evidence of data exfiltration, unauthorized remote execution, or malicious intent.
能力评估
Purpose & Capability
Name/description (ITIL-style incident/problem/change mgmt for agents) match the shipped instructions and script: the script reads journalctl, cron job state, agent memory, checks local health endpoints, classifies severities, and records state. The resources accessed (user journals, agent memory, coordination tasks) are expected for on-host monitoring of agent services.
Instruction Scope
The SKILL.md and scripts focus on detection, classification, ticket creation, and state storage. The script reads many local files and agent memory directories ($HOME/.skcapstone, $HOME/.openclaw, coordination tasks), which is appropriate for this purpose, but you should notice it has write actions (it saves itil-state.json and may create coordination tasks). The documentation mentions agents can "detect AND fix many incidents autonomously" (and "auto-remediate" in places) but the visible script primarily detects and records; if you rely on automatic remediation behavior, confirm whether additional code (not present in the provided files) performs changes.
Install Mechanism
No install spec is provided (instruction-only plus a shipped script). This minimizes supply-chain risk — nothing is downloaded or installed automatically by the skill. The included Bash script is run locally and will be written to disk when the skill is installed, which is expected for an ops helper.
Credentials
The skill requests no credentials or special environment variables, which fits its local-monitoring role. However it reads potentially sensitive local data: journal logs, agent memory JSON files, coordination task files, and local health endpoints. These accesses are reasonable for an on-host monitoring tool but mean the skill will see any secrets present in those stores — review what your agent stores in its memory and coordination paths before enabling.
Persistence & Privilege
The skill is not always-enabled and does not request system-wide configuration changes. It writes state to its own agent memory path and logs to an agent-local log file, which is expected. It does not request elevated credentials or alter other skills' configs in the provided code.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install itil-ops
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /itil-ops 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
v1.0.0 — Initial release. ITIL 4 incident, problem, and change management for AI agents. Born from fixing a midnight watchdog crash loop. Includes automated detection script, severity classification, pattern-based escalation, root cause analysis workflows, and ITIL-to-agent-ops reference mapping.
元数据
Slug itil-ops
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

ITIL Ops 是什么?

ITIL-aligned incident, problem, and change management for AI agents. Use when: detecting service crashes, analyzing recurring failures, tracking incidents to... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 141 次。

如何安装 ITIL Ops?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install itil-ops」即可一键安装,无需额外配置。

ITIL Ops 是免费的吗?

是的,ITIL Ops 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

ITIL Ops 支持哪些平台?

ITIL Ops 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 ITIL Ops?

由 chefboyrdave21(@chefboyrdave21)开发并维护,当前版本 v1.0.0。

💬 留言讨论