← 返回 Skills 市场
qq280948982

K8s Debug

作者 qq280948982 · GitHub ↗ · v0.1.0 · MIT-0
cross-platform ⚠ suspicious
407
总下载
0
收藏
3
当前安装
1
版本数
在 OpenClaw 中安装
/install k8s-debug
功能描述
Diagnose and fix Kubernetes pods, CrashLoopBackOff, Pending, DNS, networking, storage, and rollout failures with kubectl.
使用说明 (SKILL.md)

Kubernetes Debugging Skill

Overview

Systematic toolkit for debugging Kubernetes clusters, workloads, networking, and storage with a deterministic, safety-first workflow.

Trigger Phrases

Use this skill when requests resemble:

  • "My pod is in CrashLoopBackOff; help me find the root cause."
  • "Service DNS works in one pod but not another."
  • "Deployment rollout is stuck."
  • "Pods are Pending and not scheduling."
  • "Cluster health looks degraded after a change."
  • "PVC is pending and pods cannot mount storage."

Prerequisites

Run from the skill directory (devops-skills-plugin/skills/k8s-debug) so relative script paths work as written.

Required

  • kubectl installed and configured.
  • An active cluster context.
  • Read access to namespaces, pods, events, services, and nodes.

Quick preflight:

kubectl config current-context
kubectl auth can-i get pods -A
kubectl auth can-i get events -A
kubectl get ns

Optional but Recommended

  • jq for more precise filtering in ./scripts/cluster_health.sh.
  • Metrics API (metrics-server) for kubectl top.
  • In-container debug tools (nslookup, getent, curl, wget, ip) for deep network tests.

Fallback behavior:

  • If optional tools are missing, scripts continue and print warnings with reduced output.
  • If kubectl top is unavailable, continue with kubectl describe and events.

When to Use This Skill

Use this skill for:

  • Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
  • Service connectivity or DNS resolution issues
  • Network policy or ingress problems
  • Volume and storage mount failures
  • Deployment rollout issues
  • Cluster health or performance degradation
  • Resource exhaustion (CPU/memory)
  • Configuration problems (ConfigMaps, Secrets, RBAC)

Safety Rules for Disruptive Commands

Default mode is read-only diagnosis first. Only execute disruptive commands after confirming blast radius and rollback.

Commands requiring explicit confirmation:

  • kubectl delete pod ... --force --grace-period=0
  • kubectl drain ...
  • kubectl rollout restart ...
  • kubectl rollout undo ...
  • kubectl debug ... --copy-to=...

Before disruptive actions:

# Snapshot current state for rollback and incident notes
kubectl get deploy,rs,pod,svc -n \x3Cnamespace> -o wide
kubectl get pod \x3Cpod-name> -n \x3Cnamespace> -o yaml > before-\x3Cpod-name>.yaml
kubectl get events -n \x3Cnamespace> --sort-by='.lastTimestamp' > before-events.txt

Reference Navigation Map

Load only the section needed for the observed symptom.

Symptom / Need Open Start section
You need an end-to-end diagnosis path ./references/troubleshooting_workflow.md General Debugging Workflow
Pod state is Pending, CrashLoopBackOff, or ImagePullBackOff ./references/troubleshooting_workflow.md Pod Lifecycle Troubleshooting
Service reachability or DNS failure ./references/troubleshooting_workflow.md Network Troubleshooting Workflow
Node pressure or performance regression ./references/troubleshooting_workflow.md Resource and Performance Workflow
PVC / PV / storage class issues ./references/troubleshooting_workflow.md Storage Troubleshooting Workflow
Quick symptom-to-fix lookup ./references/common_issues.md matching issue heading
Post-mortem fix options for known issues ./references/common_issues.md Solutions sections

Scripts Overview

Script Purpose Required args Optional args Output Fallback behavior
./scripts/cluster_health.sh Cluster-wide health snapshot (nodes, workloads, events, common failure states) None --strict, K8S_REQUEST_TIMEOUT env var Sectioned report to stdout Continues on check failures, tracks them in summary and exit code
./scripts/network_debug.sh Pod-centric network and DNS diagnostics \x3Cpod-name> (\x3Cnamespace> defaults to default) --strict, --insecure, K8S_REQUEST_TIMEOUT env var Sectioned report to stdout Uses secure API probe by default; insecure TLS requires explicit --insecure
./scripts/pod_diagnostics.py Deep pod diagnostics (status, describe, YAML, events, per-container logs, node context) \x3Cpod-name> -n/--namespace, -o/--output Sectioned report to stdout or file Fails fast on missing access; skips optional metrics/log blocks with clear messages

Script Exit Codes

./scripts/cluster_health.sh and ./scripts/network_debug.sh share the same contract:

  • 0: checks completed with no check failures (warnings allowed unless --strict is set).
  • 1: one or more checks failed, or warnings occurred in --strict mode.
  • 2: blocked preconditions (for example: missing kubectl, no active context, inaccessible namespace/pod).

Deterministic Debugging Workflow

Follow this systematic approach for any Kubernetes issue:

1. Preflight and Scope

kubectl config current-context
kubectl get ns
kubectl auth can-i get pods -n \x3Cnamespace>

If preflight fails, stop and fix access/context first.

2. Identify the Problem Layer

Categorize the issue:

  • Application Layer: Application crashes, errors, bugs
  • Pod Layer: Pod not starting, restarting, or pending
  • Service Layer: Network connectivity, DNS issues
  • Node Layer: Node not ready, resource exhaustion
  • Cluster Layer: Control plane issues, API problems
  • Storage Layer: Volume mount failures, PVC issues
  • Configuration Layer: ConfigMap, Secret, RBAC issues

3. Gather Diagnostics with the Right Script

Use the appropriate diagnostic script based on scope:

Pod-Level Diagnostics

Use ./scripts/pod_diagnostics.py for comprehensive pod analysis:

python3 ./scripts/pod_diagnostics.py \x3Cpod-name> -n \x3Cnamespace>

This script gathers:

  • Pod status and description
  • Pod events
  • Container logs (current and previous)
  • Resource usage
  • Node information
  • YAML configuration

Output can be saved for analysis:

python3 ./scripts/pod_diagnostics.py \x3Cpod-name> -n \x3Cnamespace> -o diagnostics.txt

Cluster-Level Health Check

Use ./scripts/cluster_health.sh for overall cluster diagnostics:

./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt

This script checks:

  • Cluster info and version
  • Node status and resources
  • Pods across all namespaces
  • Failed/pending pods
  • Recent events
  • Deployments, services, statefulsets, daemonsets
  • PVCs and PVs
  • Component health
  • Common error states (CrashLoopBackOff, ImagePullBackOff)

Network Diagnostics

Use ./scripts/network_debug.sh for connectivity issues:

./scripts/network_debug.sh \x3Cnamespace> \x3Cpod-name>
# or force warning sensitivity / insecure TLS only when explicitly needed:
./scripts/network_debug.sh --strict \x3Cnamespace> \x3Cpod-name>
./scripts/network_debug.sh --insecure \x3Cnamespace> \x3Cpod-name>

This script analyzes:

  • Pod network configuration
  • DNS setup and resolution
  • Service endpoints
  • Network policies
  • Connectivity tests
  • CoreDNS logs

4. Follow Issue-Specific Reference Workflow

Based on the identified issue, consult ./references/troubleshooting_workflow.md:

  • Pod Pending: Resource/scheduling workflow
  • CrashLoopBackOff: Application crash workflow
  • ImagePullBackOff: Image pull workflow
  • Service issues: Network connectivity workflow
  • DNS failures: DNS troubleshooting workflow
  • Resource exhaustion: Performance investigation workflow
  • Storage issues: PVC binding workflow
  • Deployment stuck: Rollout workflow

5. Apply Targeted Fixes

Refer to ./references/common_issues.md for symptom-specific fixes.

6. Verify and Close

Run final verification:

kubectl get pods -n \x3Cnamespace> -o wide
kubectl get events -n \x3Cnamespace> --sort-by='.lastTimestamp' | tail -20
kubectl rollout status deployment/\x3Cname> -n \x3Cnamespace>

Issue is done when user-visible behavior is healthy and no new critical warning events appear.

Example Flows

Example 1: CrashLoopBackOff in payments Namespace

python3 ./scripts/pod_diagnostics.py payments-api-7c97f95dfb-q9l7k -n payments -o payments-diagnostics.txt
kubectl logs payments-api-7c97f95dfb-q9l7k -n payments --previous --tail=100
kubectl get deploy payments-api -n payments -o yaml | grep -A 8 livenessProbe

Then open ./references/common_issues.md and apply the CrashLoopBackOff solutions.

Example 2: Service DNS/Connectivity Failure

./scripts/network_debug.sh checkout checkout-api-75f49c9d8f-z6qtm
kubectl get svc checkout-api -n checkout
kubectl get endpoints checkout-api -n checkout
kubectl get networkpolicies -n checkout

Then follow Service Connectivity Workflow in ./references/troubleshooting_workflow.md.

Essential Manual Commands

Pod Debugging

# View pod status
kubectl get pods -n \x3Cnamespace> -o wide

# Detailed pod information
kubectl describe pod \x3Cpod-name> -n \x3Cnamespace>

# View logs
kubectl logs \x3Cpod-name> -n \x3Cnamespace>
kubectl logs \x3Cpod-name> -n \x3Cnamespace> --previous  # Previous container
kubectl logs \x3Cpod-name> -n \x3Cnamespace> -c \x3Ccontainer>  # Specific container

# Execute commands in pod
kubectl exec \x3Cpod-name> -n \x3Cnamespace> -it -- /bin/sh

# Get pod YAML
kubectl get pod \x3Cpod-name> -n \x3Cnamespace> -o yaml

Service and Network Debugging

# Check services
kubectl get svc -n \x3Cnamespace>
kubectl describe svc \x3Cservice-name> -n \x3Cnamespace>

# Check endpoints
kubectl get endpoints -n \x3Cnamespace>

# Test DNS
kubectl exec \x3Cpod-name> -n \x3Cnamespace> -- nslookup kubernetes.default

# View events
kubectl get events -n \x3Cnamespace> --sort-by='.lastTimestamp'

Resource Monitoring

# Node resources
kubectl top nodes
kubectl describe nodes

# Pod resources
kubectl top pods -n \x3Cnamespace>
kubectl top pod \x3Cpod-name> -n \x3Cnamespace> --containers

Emergency Operations

# Restart deployment
kubectl rollout restart deployment/\x3Cname> -n \x3Cnamespace>

# Rollback deployment
kubectl rollout undo deployment/\x3Cname> -n \x3Cnamespace>

# Force delete stuck pod
kubectl delete pod \x3Cpod-name> -n \x3Cnamespace> --force --grace-period=0

# Drain node (maintenance)
kubectl drain \x3Cnode-name> --ignore-daemonsets --delete-emptydir-data

# Cordon node (prevent scheduling)
kubectl cordon \x3Cnode-name>

Completion Criteria

Troubleshooting session is complete when all are true:

  • Cluster context and namespace are confirmed.
  • Relevant diagnostic script output is captured.
  • Root cause is identified and tied to evidence (events/logs/config/state).
  • Any disruptive action was preceded by snapshot and rollback plan.
  • Fix verification commands show healthy state.
  • Reference path used (./references/troubleshooting_workflow.md or ./references/common_issues.md) is documented in notes.

Related Tools

Useful additional tools for Kubernetes debugging:

  • kubectl-debug: Advanced debugging plugin
  • stern: Multi-pod log tailing
  • kubectx/kubens: Context and namespace switching
  • k9s: Terminal UI for Kubernetes
  • lens: Desktop IDE for Kubernetes
  • Prometheus/Grafana: Monitoring and alerting
  • Jaeger/Zipkin: Distributed tracing
安全使用建议
This skill appears coherent for Kubernetes troubleshooting. Before installing or running it: ensure you trust the author (source unknown), verify you have a correct kubectl context, and review the scripts locally. Be aware the network diagnostic script may kubectl exec into pods and read the container's serviceaccount token to test API access — that is normal for in-pod API probes but is sensitive data. Disruptive commands (delete, drain, rollout undo/restart) are present in examples and marked as requiring explicit confirmation; do not run those without understanding blast radius and having backups/rollbacks available.
功能分析
Type: OpenClaw Skill Name: k8s-debug Version: 0.1.0 The k8s-debug skill bundle provides comprehensive Kubernetes diagnostic capabilities but contains high-risk behaviors and security vulnerabilities. The scripts `scripts/network_debug.sh` and `scripts/pod_diagnostics.py` perform sensitive operations including executing arbitrary commands inside pods via `kubectl exec` and reading service account tokens to probe the Kubernetes API. Additionally, `scripts/network_debug.sh` and `scripts/cluster_health.sh` are vulnerable to shell injection because they interpolate variables like `$POD_NAME` directly into command strings executed via `bash -c` in the `run_pipe_or_warn` function. While these actions are plausibly intended for debugging, the combination of high-privilege access and lack of input sanitization poses a significant risk.
能力评估
Purpose & Capability
The name/description match the included scripts and reference docs. The files implement kubectl-driven cluster, network, and pod diagnostics (cluster_health.sh, network_debug.sh, pod_diagnostics.py) which are appropriate for a K8s debugging skill.
Instruction Scope
SKILL.md instructs the agent to run local scripts and kubectl commands, perform read-only diagnostics by default, and require explicit confirmation for disruptive commands. Scripts operate on local kubectl context and in-cluster pods; they do not contain calls to external servers or instructions to exfiltrate data.
Install Mechanism
No install spec is provided (instruction-only with bundled scripts). That minimizes install-time risk; the skill expects existing tools like kubectl/jq rather than pulling remote binaries.
Credentials
The skill declares no required env vars or credentials, which is appropriate. Note: network_debug.sh may exec into a pod and read the in-pod serviceaccount token (from /var/run/secrets/...) to run authenticated API probes — this is consistent with deep network diagnostics but is sensitive data inside the container. The scripts check RBAC and fall back if exec or tokens are not available.
Persistence & Privilege
always is false and there is no install-time modification of other skills or persistent agent settings. The skill runs on-demand and requires the user's kubectl context to operate.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install k8s-debug
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /k8s-debug 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.1.0
Initial release of k8s-debug – a systematic toolkit for diagnosing and fixing Kubernetes pod, rollout, network, DNS, and storage issues. - Provides trigger phrases to detect when Kubernetes debugging is needed. - Details prerequisites and safety practices, including restricted use of disruptive kubectl commands. - Introduces modular troubleshooting scripts for pod, network, and cluster diagnostics. - Includes structured workflows for isolating and resolving common cluster issues. - Offers reference navigation for quick symptom-to-solution lookup. - Documents expected script usage, arguments, and fallback behaviors.
元数据
Slug k8s-debug
版本 0.1.0
许可证 MIT-0
累计安装 3
当前安装数 3
历史版本数 1
常见问题

K8s Debug 是什么?

Diagnose and fix Kubernetes pods, CrashLoopBackOff, Pending, DNS, networking, storage, and rollout failures with kubectl. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 407 次。

如何安装 K8s Debug?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install k8s-debug」即可一键安装,无需额外配置。

K8s Debug 是免费的吗?

是的,K8s Debug 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

K8s Debug 支持哪些平台?

K8s Debug 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 K8s Debug?

由 qq280948982(@qq280948982)开发并维护,当前版本 v0.1.0。

💬 留言讨论