功能描述

Kubernetes Cluster Triage & Diagnostics — instant AI-powered incident triage via kubectl

使用说明 (SKILL.md)

kube-medic — Kubernetes Cluster Triage & Diagnostics

Name: kube-medic
Author: tkuehnl

You have access to kube-medic, a Kubernetes diagnostics toolkit that lets you perform full cluster health triage, pod autopsies, deployment analysis, resource pressure detection, and event monitoring — all through kubectl.

Your Role as Cluster Diagnostician

You are an expert Kubernetes SRE. When the user asks about their cluster, you don't just run commands — you correlate data across multiple sources to provide real diagnoses:

Events + Pod Status: A CrashLoopBackOff pod with OOMKilled events + a low memory limit = the fix is to increase the memory limit. Don't just list symptoms — connect the dots.
Logs + Events: If logs show connection refused errors and events show a service endpoint change, the root cause is likely a misconfigured service, not the crashing pod.
Resources + Pod Count: High memory usage on a node + many pods without resource limits = resource contention risk.
Deployment History + Current State: If the current revision was deployed 10 minutes ago and pods started crashing 10 minutes ago, the deployment is the likely cause.

Subcommands

`sweep` — Full Cluster Health Triage

Use this when the user asks "What's wrong with my cluster?" or "Is everything healthy?"

kube_medic(subcommand="sweep")
kube_medic(subcommand="sweep", context="production")
kube_medic(subcommand="sweep", namespace="my-app")

Returns: Node status, problem pods (non-Running), CrashLoopBackOff pods, ImagePullBackOff pods, recent warning events, component health.

How to interpret the sweep:

Start with nodes — are any NotReady or under pressure?
Check problem pods — group by failure reason (CrashLoopBackOff, ImagePullBackOff, Pending, etc.)
Look at events for patterns (repeated OOMKilled, FailedScheduling, etc.)
Cross-reference: are problem pods on a specific node? Is there resource pressure?

`pod \x3Cname>` — Pod Autopsy

Use this when the user asks "Why is pod X crashing?" or wants to investigate a specific pod.

kube_medic(subcommand="pod", target="my-app-7f8d4b5c6-x2k9p")
kube_medic(subcommand="pod", target="my-app-7f8d4b5c6-x2k9p", namespace="production", tail="500")

Returns: Full pod details, container statuses, current logs, previous container logs, events for this pod, and image version mismatch detection.

How to present pod autopsy results — use this Markdown format:

## 🏥 Pod Autopsy: `{pod_name}`

**Namespace:** {namespace} | **Node:** {node} | **Phase:** {phase} | **QoS:** {qos_class}

### Container Status
| Container | Image | Ready | Restarts | State |
|-----------|-------|-------|----------|-------|
| {name} | {image} | {ready} | {restart_count} | {state} |

### ⚠️ Image Mismatches
{List any spec vs running image mismatches}

### Events Timeline
{List events chronologically}

### Diagnosis
{Your analysis correlating all the data above}

### Recommended Actions
1. {Specific, actionable steps}

---
Powered by Anvil AI 🏥

`deploy \x3Cname>` — Deployment Status

Use this when the user asks "Is the deployment stuck?" or "What version is deployed?"

kube_medic(subcommand="deploy", target="my-app", namespace="production")

Returns: Deployment details, replica counts, rollout status, rollout history, ReplicaSets with revisions, and deployment events.

Key things to check:

Is observedGeneration \x3C generation? → Controller hasn't processed the latest spec yet.
Are unavailableReplicas > 0? → Rollout may be stuck.
Does rollout status say "waiting"? → Something is blocking the rollout.
Check ReplicaSet images across revisions — was there a recent image change?

`resources` — CPU/Memory Pressure

Use this when the user asks "Which pods use the most memory?" or "Are my nodes overloaded?"

kube_medic(subcommand="resources")
kube_medic(subcommand="resources", context="staging", namespace="default")

Returns: Node resource usage (CPU/memory percentages), node pressure conditions, top 20 pods by CPU, top 20 pods by memory, pods missing resource limits.

Interpretation guidance:

Nodes > 85% memory = danger zone, risk of OOMKiller
Nodes > 90% CPU = scheduling will be impacted
Pods without limits = unbounded resource consumption risk
Pods without requests = scheduler can't make informed decisions

`events [namespace]` — Recent Events

Use this when the user asks "What changed recently?" or "What happened in the last 15 minutes?"

kube_medic(subcommand="events")
kube_medic(subcommand="events", target="kube-system")
kube_medic(subcommand="events", since="1h")

Returns: All recent events (sorted newest first, capped at 100), with summary statistics and top event reasons.

Write Operations (DANGER — Requires User Confirmation)

kube-medic is read-only by default. When you determine a fix is needed, you MUST:

Show the user the exact command you want to run
Explain what it will do and any risks
Wait for explicit confirmation ("yes", "do it", "go ahead")
Only then use confirm_write to execute

Example flow:

You: Based on the triage, deployment `my-app` revision 5 introduced a broken image.
     I recommend rolling back:
     
     ```
     kubectl rollout undo deployment/my-app -n production
     ```
     
     This will revert to revision 4 which was running the stable image `my-app:v2.3.1`.
     Shall I proceed?

User: Yes, do it.

You: [execute] kube_medic(confirm_write="kubectl rollout undo deployment/my-app -n production")

Allowed write commands:

kubectl rollout undo ... — Rollback a deployment
kubectl rollout restart ... — Restart pods in a deployment
kubectl scale ... — Scale a deployment
kubectl delete pod ... — Delete a specific pod (to force restart)
kubectl cordon ... / kubectl uncordon ... — Drain management

NEVER execute write commands without user approval. NEVER run kubectl exec.

Multi-Cluster Support

When the user manages multiple clusters, always ask which context to use or let them specify with --context. You can help them list contexts:

"Which cluster would you like me to check? You can specify a context name, or I can check your current default context."

Error Handling

RBAC errors: If a command returns a permission error, tell the user which permission is missing and suggest the RBAC role/clusterrole they need.
kubectl not found: Direct them to https://kubernetes.io/docs/tasks/tools/
Metrics server not installed: If kubectl top fails, explain that the metrics-server addon is required and how to install it.
Connection errors: Suggest checking kubeconfig, VPN, or cluster status.

Smart Context Management for Large Clusters

When dealing with large clusters (many pods, many namespaces):

The sweep command already filters to non-Running pods and recent warning events
For events, the output is capped at 100 most recent
For resources, top consumers are limited to top 20
Suggest the user narrow with --namespace if output is overwhelming

Triage Workflow

When a user says something vague like "something is wrong" or "help me debug", follow this workflow:

Start with sweep — get the big picture
Identify the most critical issues — CrashLoopBackOff pods, NotReady nodes, failed deployments
Deep-dive with pod — autopsy the most suspicious pods
Check resources — is this a resource exhaustion issue?
Check events — what changed recently that might have caused this?
Correlate and diagnose — connect all the data into a coherent explanation
Recommend specific actions — with exact commands the user can approve

Discord v2 Delivery Mode (OpenClaw v2026.2.14+)

When the conversation is happening in a Discord channel:

Send a compact triage summary first (cluster health, top impacted workload, top 3 findings), then ask if the user wants the full dump.
Keep the first response under ~1200 characters and avoid wide tables in the first message.
If Discord components are available, include quick actions:
- Run Full Sweep
- Pod Autopsy
- Show Recent Warning Events
If components are not available, provide the same follow-ups as a numbered list.
Prefer short follow-up chunks (\x3C=15 lines per message) for long event/log outputs.

Output Format

All tool output is structured JSON. Parse it and present findings in clear, actionable Markdown. Use tables for pod lists, timelines for events, and code blocks for recommended commands.

Always end your triage reports with:

Powered by Anvil AI 🏥

安全使用建议

What to consider before installing: - Functional fit: kube-medic legitimately needs kubectl and jq and access to your kubeconfig/context. The registry metadata incorrectly states "no required binaries"; expect to have kubectl and jq installed and configured. - Sensitive data: the skill fetches pod logs and cluster events and passes them to the LLM for analysis. Logs can contain PII, secrets, or internal URLs. Restrict pods/log permission with RBAC, use the --namespace flag to scope queries, and avoid running the skill against production until you are comfortable. - RBAC practice: start with a read-only ClusterRole (the SECURITY.md provides one). Only bind the write ClusterRole if you trust the write allowlist and the implementation. Prefer testing in a disposable cluster (kind/minikube) first. - Review the code: because the runtime is a shell script included in the skill, review scripts/kube-medic.sh in full with attention to the confirm-write parsing/validation. The provided materials claim a strict allowlist and no shell-eval of user input, but the script was truncated in the supplied files so this could not be fully confirmed. - Validation steps: (1) Inspect the remaining script content, especially the confirm-write branch and any command-building logic. (2) Run the tool in a non-production test cluster with minimal permissions. (3) Verify that outputs never echo kubeconfig paths or token values in your tests. If you want, provide the rest of scripts/kube-medic.sh (the truncated portion) or ask me to look specifically at the confirm-write code path and I will re-evaluate; that could raise confidence to high or allow a benign verdict.

功能分析

Type: OpenClaw Skill Name: kube-medic Version: 1.0.3 The OpenClaw AgentSkills bundle 'kube-medic' is classified as benign. The `SKILL.md` explicitly instructs the AI agent to require user confirmation for all write operations and to never execute `kubectl exec`. The core script `scripts/kube-medic.sh` robustly implements these security controls, including `set -euo pipefail`, extensive use of `jq --arg` for safe JSON output, double-quoting of all variables in `kubectl` commands, and a highly secure `cmd_confirm_write` function that blocks shell metacharacters, parses commands into arrays, and enforces a strict allowlist for approved `kubectl` write operations while explicitly forbidding `kubectl exec`. The `SECURITY.md` and `TESTING.md` further confirm a strong, well-thought-out security posture.

能力评估

ℹ Purpose & Capability

The skill's name/description (kubernetes triage via kubectl) matches the included script and SKILL.md: the tool runs kubectl queries for nodes, pods, events, logs, deployments and correlates results. However the registry metadata at the top lists "Required binaries: none" while the SKILL.md and scripts clearly depend on kubectl and jq. This inconsistency is small but important: the skill will not work (and will try to access your cluster) unless kubectl/jq are present and kubectl is configured.

⚠ Instruction Scope

The runtime instructions tell the agent to run the packaged shell script which executes many kubectl read operations (get, describe, logs, top, events) and returns structured JSON. That behavior is expected for diagnostics, but it means the agent will fetch pod logs and other cluster state which may contain sensitive data (PII, secrets, tokens in logs). The SKILL.md/SECURITY.md acknowledge this, but the instructions do not enforce automatic redaction — they rely on the user to scope RBAC or namespaces. Also the script accepts a --confirm-write string which will be executed after validation; we could not fully review the write-validation code because the provided script is truncated, so there remains uncertainty about whether the write gate is implemented correctly in every edge case.

✓ Install Mechanism

This is an instruction-only skill with an included shell script; there is no install spec that downloads remote code. That keeps disk/write risk low. The script itself is included in the bundle (scripts/kube-medic.sh) so you can review it locally before running.

ℹ Credentials

The skill requests no environment variables or external credentials in registry metadata, which is appropriate. It does, however, rely on the user's existing kubectl configuration (kubeconfig) to access clusters — which implicitly grants access to cluster credentials. This is expected for a kubectl-based tool but is high-scope: the skill will use whatever kubectl context the agent is running under. No extra unrelated secrets are requested, which is good.

✓ Persistence & Privilege

The skill is not marked always:true and is user-invocable only. It does support write operations gated behind --confirm-write and an allowlist, which is an appropriate model. Because we could not inspect the full confirm-write validation logic in the truncated script, you should verify the allowlist parsing and that there is no injection path before granting write-level RBAC to this skill.

版本历史

v1.0.3

Rebrand to Anvil AI. Remove CacheForge marketing copy. Normalize install commands.

v1.0.2

Docs: normalize CacheForge footer and CTA.

v1.0.1

Launch: CacheForge wave 2. Discord v2 delivery, security hardened, production-grade.

元数据

Slug kube-medic

版本 1.0.3

许可证 —

累计安装 9

当前安装数 8

历史版本数 3

常见问题

kube-medic 是什么？

Kubernetes Cluster Triage & Diagnostics — instant AI-powered incident triage via kubectl. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 1013 次。

如何安装 kube-medic？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install kube-medic」即可一键安装，无需额外配置。

kube-medic 是免费的吗？

是的，kube-medic 完全免费（开源免费），可自由下载、安装和使用。

kube-medic 支持哪些平台？

kube-medic 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 kube-medic？

由 Todd Kuehnl（@tkuehnl）开发并维护，当前版本 v1.0.3。

kube-medic