← 返回 Skills 市场
charlie-morrison

Chaos Test Designer

作者 charlie-morrison · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
30
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install chaos-test-designer
功能描述
Design chaos engineering experiments to test system resilience. Generate failure injection scenarios, define steady-state hypotheses, blast radius controls,...
使用说明 (SKILL.md)

Chaos Test Designer

Design chaos engineering experiments that safely test your system's resilience. Define steady-state hypotheses, inject controlled failures (service crashes, network partitions, resource exhaustion, dependency outages), measure impact, and produce runnable experiment definitions for Chaos Monkey, Litmus, Gremlin, or plain scripts.

Use when: "design chaos test", "test system resilience", "what happens if this service dies", "failure injection", "game day planning", "chaos engineering", "test our fallbacks", or before declaring a service production-ready.

Commands

1. design — Create Chaos Experiment

Step 1: Understand the System

# Discover services and dependencies
kubectl get deployments -A 2>/dev/null | grep -v kube-system
docker compose config --services 2>/dev/null
# Or read architecture docs
find . -maxdepth 3 -name "*.md" | xargs grep -li "architecture\|topology\|dependency" 2>/dev/null

Map the dependency graph:

  • Which services call which?
  • What are the single points of failure?
  • Where are the circuit breakers, retries, fallbacks?
  • What external dependencies exist (databases, caches, queues, third-party APIs)?

Step 2: Define Steady-State Hypothesis

Before breaking anything, define what "normal" looks like:

## Steady-State Hypothesis
- Homepage loads in \x3C 500ms (p95)
- API error rate \x3C 0.1%
- Orders processed within 30 seconds of submission
- Background jobs backlog \x3C 100 items
- All health check endpoints return 200

This is the baseline you'll compare against during the experiment.

Step 3: Select Failure Mode

Common failure modes ranked by severity:

Level 1 — Service Failures (start here)

  • Kill a single pod/container instance
  • Restart a service with delay
  • Reduce replica count to 1

Level 2 — Network Failures

  • Add latency (100ms, 500ms, 2000ms) to inter-service calls
  • Drop 10% of packets to a specific service
  • DNS resolution failures
  • Block traffic to a specific dependency

Level 3 — Resource Exhaustion

  • Fill disk to 95%
  • Consume all available memory (OOM scenarios)
  • Saturate CPU
  • Exhaust database connection pool
  • Fill message queue to capacity

Level 4 — Dependency Failures

  • External API returns 500 for all requests
  • Database becomes read-only
  • Cache becomes unavailable
  • Message broker stops accepting messages

Level 5 — Infrastructure Failures (advanced)

  • Availability zone failure (kill all resources in one AZ)
  • Region failover
  • Complete network partition between services

Step 4: Generate Experiment Definition

# Chaos Experiment: [Service] [Failure Type]
# Generated by chaos-test-designer

experiment:
  name: "payment-service-pod-kill"
  description: "Kill payment service pod to verify retry logic and circuit breaker"
  
  steady_state:
    - probe: http
      url: "http://payment-service:8080/health"
      expect_status: 200
    - probe: prometheus
      query: "rate(http_requests_total{service='payment',status='5xx'}[1m])"
      expect: "\x3C 0.01"
  
  method:
    - action: kill-pod
      target:
        namespace: production
        label_selector: "app=payment-service"
      count: 1
      
  rollback:
    - action: scale
      target:
        namespace: production
        deployment: payment-service
      replicas: 3
      
  controls:
    blast_radius: "single pod in production"
    duration: "5 minutes"
    abort_conditions:
      - "error_rate > 5%"
      - "p99_latency > 10s"
    business_hours_only: true

For Kubernetes (Litmus):

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-chaos
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "300"
            - name: CHAOS_INTERVAL
              value: "60"
            - name: FORCE
              value: "false"

For plain bash:

#!/usr/bin/env bash
# Chaos: Kill payment-service pod
set -euo pipefail

echo "📊 Capturing steady state..."
BASELINE_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
  'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
  python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Baseline error rate: $BASELINE_ERROR_RATE"

echo "💥 Injecting failure: killing one payment-service pod..."
POD=$(kubectl get pods -l app=payment-service -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod "$POD" --grace-period=0

echo "⏱️  Observing for 5 minutes..."
sleep 300

echo "📊 Measuring impact..."
POST_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
  'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
  python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Post-chaos error rate: $POST_ERROR_RATE"

echo "✅ Verifying recovery..."
kubectl get pods -l app=payment-service

2. gameday — Plan a Game Day

Generate a full game day schedule:

  • Pre-game briefing (objectives, safety controls, escalation contacts)
  • Experiment sequence (ordered by risk, with breaks between)
  • Observation assignments (who monitors what dashboard)
  • Go/no-go criteria between experiments
  • Post-game debrief template

3. audit — Assess Chaos Readiness

Before running chaos experiments, verify the system has:

  • Health check endpoints on every service
  • Monitoring and alerting in place
  • Circuit breakers or retry logic
  • Graceful degradation modes
  • Runbooks for common failures
  • Rollback procedures tested recently

Score readiness 0-100 and recommend prerequisites before first chaos experiment.

4. report — Analyze Experiment Results

After running an experiment, produce a findings report:

  • Did the steady-state hypothesis hold?
  • What broke? Was it expected?
  • How long until the system self-healed?
  • What's the blast radius in production vs expected?
  • Remediation recommendations (add circuit breaker, fix retry logic, add redundancy)
安全使用建议
This skill is coherent with chaos-engineering activities but the package metadata is incomplete and the runtime steps are explicitly destructive. Before installing or enabling it: 1) don't allow autonomous runs against production — require human confirmation for any destructive step; 2) ensure the agent's kubectl context and service account are restricted to a non-production namespace with least privilege; 3) confirm required binaries (kubectl, docker compose, curl, python3) and the PROMETHEUS endpoint are declared and provided intentionally; 4) test all generated experiments in staging only and have clear abort/runbooks; 5) if you cannot verify who operates the agent or the exact cluster targeted, do not enable this skill in environments with sensitive production data.
功能分析
Type: OpenClaw Skill Name: chaos-test-designer Version: 1.0.0 The skill is designed for chaos engineering and failure injection, which involves inherently risky operations such as terminating Kubernetes pods and simulating network or resource failures. While the behavior aligns with the stated purpose, the broad discovery commands (e.g., 'kubectl get deployments -A' and filesystem searches in SKILL.md) and the generation of destructive shell scripts constitute high-risk capabilities that could be misused. No evidence of malicious intent, such as data exfiltration or unauthorized persistence, was found.
能力标签
cryptocan-make-purchases
能力评估
Purpose & Capability
The described purpose (design chaos experiments) legitimately requires access to Kubernetes, Docker, monitoring, and the ability to run shell tools; however the registry metadata declares no required binaries, env vars, or credentials. That mismatch (metadata says 'none' while instructions call for kubectl, docker compose, curl, python3, and cluster credentials) is inconsistent and surprising.
Instruction Scope
The SKILL.md gives explicit, destructive steps (kubectl delete pod, scaling deployments, killing resources in namespace=production, AZ/region failover scenarios) and commands that read local files (find/grep) and call monitoring endpoints. Those actions are within the domain of chaos engineering but are high-risk; the instructions also reference $PROMETHEUS and assume access to cluster context and privileged service accounts (e.g., litmus-admin) without any safety gates in the metadata.
Install Mechanism
This is an instruction-only skill with no install spec, which minimizes disk-write/install risk. However, lack of install does not remove risk because the runtime instructions invoke external tools and cluster operations.
Credentials
No environment variables or credentials are declared, but the runtime examples reference $PROMETHEUS and implicitly require Kubernetes cluster credentials (kubectl) and possibly privileged service accounts. The skill should declare these requirements and justify them; as-is, it asks for operations that need sensitive permissions but doesn't tell the user what will be needed.
Persistence & Privilege
always is false and autonomous invocation is allowed (the platform default). Autonomous invocation combined with destructive instructions could be dangerous if the agent is allowed to run skills without human confirmation — that risk stems from operational use, not the skill metadata itself.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install chaos-test-designer
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /chaos-test-designer 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of chaos-test-designer. - Design chaos engineering experiments to test system resilience through failure injection. - Generate steady-state hypotheses, define blast radius controls, and provide rollback procedures. - Supports experiment definitions for tools like Chaos Monkey, Litmus, Gremlin, and plain scripts. - Includes commands to design experiments, plan game days, audit system readiness, and report findings.
元数据
Slug chaos-test-designer
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Chaos Test Designer 是什么?

Design chaos engineering experiments to test system resilience. Generate failure injection scenarios, define steady-state hypotheses, blast radius controls,... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 30 次。

如何安装 Chaos Test Designer?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install chaos-test-designer」即可一键安装,无需额外配置。

Chaos Test Designer 是免费的吗?

是的,Chaos Test Designer 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Chaos Test Designer 支持哪些平台?

Chaos Test Designer 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Chaos Test Designer?

由 charlie-morrison(@charlie-morrison)开发并维护,当前版本 v1.0.0。

💬 留言讨论