功能描述

Design chaos engineering experiments to test system resilience. Generate failure injection scenarios, define steady-state hypotheses, blast radius controls,...

使用说明 (SKILL.md)

Chaos Test Designer

Name: Chaos Test Designer
Author: charlie-morrison

Design chaos engineering experiments that safely test your system's resilience. Define steady-state hypotheses, inject controlled failures (service crashes, network partitions, resource exhaustion, dependency outages), measure impact, and produce runnable experiment definitions for Chaos Monkey, Litmus, Gremlin, or plain scripts.

Use when: "design chaos test", "test system resilience", "what happens if this service dies", "failure injection", "game day planning", "chaos engineering", "test our fallbacks", or before declaring a service production-ready.

Commands

1. `design` — Create Chaos Experiment

Step 1: Understand the System

# Discover services and dependencies
kubectl get deployments -A 2>/dev/null | grep -v kube-system
docker compose config --services 2>/dev/null
# Or read architecture docs
find . -maxdepth 3 -name "*.md" | xargs grep -li "architecture\|topology\|dependency" 2>/dev/null

Map the dependency graph:

Which services call which?
What are the single points of failure?
Where are the circuit breakers, retries, fallbacks?
What external dependencies exist (databases, caches, queues, third-party APIs)?

Step 2: Define Steady-State Hypothesis

Before breaking anything, define what "normal" looks like:

## Steady-State Hypothesis
- Homepage loads in \x3C 500ms (p95)
- API error rate \x3C 0.1%
- Orders processed within 30 seconds of submission
- Background jobs backlog \x3C 100 items
- All health check endpoints return 200

This is the baseline you'll compare against during the experiment.

Step 3: Select Failure Mode

Common failure modes ranked by severity:

Level 1 — Service Failures (start here)

Kill a single pod/container instance
Restart a service with delay
Reduce replica count to 1

Level 2 — Network Failures

Add latency (100ms, 500ms, 2000ms) to inter-service calls
Drop 10% of packets to a specific service
DNS resolution failures
Block traffic to a specific dependency

Level 3 — Resource Exhaustion

Fill disk to 95%
Consume all available memory (OOM scenarios)
Saturate CPU
Exhaust database connection pool
Fill message queue to capacity

Level 4 — Dependency Failures

External API returns 500 for all requests
Database becomes read-only
Cache becomes unavailable
Message broker stops accepting messages

Level 5 — Infrastructure Failures (advanced)

Availability zone failure (kill all resources in one AZ)
Region failover
Complete network partition between services

Step 4: Generate Experiment Definition

# Chaos Experiment: [Service] [Failure Type]
# Generated by chaos-test-designer

experiment:
  name: "payment-service-pod-kill"
  description: "Kill payment service pod to verify retry logic and circuit breaker"
  
  steady_state:
    - probe: http
      url: "http://payment-service:8080/health"
      expect_status: 200
    - probe: prometheus
      query: "rate(http_requests_total{service='payment',status='5xx'}[1m])"
      expect: "\x3C 0.01"
  
  method:
    - action: kill-pod
      target:
        namespace: production
        label_selector: "app=payment-service"
      count: 1
      
  rollback:
    - action: scale
      target:
        namespace: production
        deployment: payment-service
      replicas: 3
      
  controls:
    blast_radius: "single pod in production"
    duration: "5 minutes"
    abort_conditions:
      - "error_rate > 5%"
      - "p99_latency > 10s"
    business_hours_only: true

For Kubernetes (Litmus):

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payment-chaos
spec:
  appinfo:
    appns: production
    applabel: app=payment-service
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: "300"
            - name: CHAOS_INTERVAL
              value: "60"
            - name: FORCE
              value: "false"

For plain bash:

#!/usr/bin/env bash
# Chaos: Kill payment-service pod
set -euo pipefail

echo "📊 Capturing steady state..."
BASELINE_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
  'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
  python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Baseline error rate: $BASELINE_ERROR_RATE"

echo "💥 Injecting failure: killing one payment-service pod..."
POD=$(kubectl get pods -l app=payment-service -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod "$POD" --grace-period=0

echo "⏱️  Observing for 5 minutes..."
sleep 300

echo "📊 Measuring impact..."
POST_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
  'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
  python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Post-chaos error rate: $POST_ERROR_RATE"

echo "✅ Verifying recovery..."
kubectl get pods -l app=payment-service

2. `gameday` — Plan a Game Day

Generate a full game day schedule:

Pre-game briefing (objectives, safety controls, escalation contacts)
Experiment sequence (ordered by risk, with breaks between)
Observation assignments (who monitors what dashboard)
Go/no-go criteria between experiments
Post-game debrief template

3. `audit` — Assess Chaos Readiness

Before running chaos experiments, verify the system has:

Health check endpoints on every service
Monitoring and alerting in place
Circuit breakers or retry logic
Graceful degradation modes
Runbooks for common failures
Rollback procedures tested recently

Score readiness 0-100 and recommend prerequisites before first chaos experiment.

4. `report` — Analyze Experiment Results

After running an experiment, produce a findings report:

Did the steady-state hypothesis hold?
What broke? Was it expected?
How long until the system self-healed?
What's the blast radius in production vs expected?
Remediation recommendations (add circuit breaker, fix retry logic, add redundancy)

安全使用建议

This skill is coherent with chaos-engineering activities but the package metadata is incomplete and the runtime steps are explicitly destructive. Before installing or enabling it: 1) don't allow autonomous runs against production — require human confirmation for any destructive step; 2) ensure the agent's kubectl context and service account are restricted to a non-production namespace with least privilege; 3) confirm required binaries (kubectl, docker compose, curl, python3) and the PROMETHEUS endpoint are declared and provided intentionally; 4) test all generated experiments in staging only and have clear abort/runbooks; 5) if you cannot verify who operates the agent or the exact cluster targeted, do not enable this skill in environments with sensitive production data.

功能分析

Type: OpenClaw Skill Name: chaos-test-designer Version: 1.0.0 The skill is designed for chaos engineering and failure injection, which involves inherently risky operations such as terminating Kubernetes pods and simulating network or resource failures. While the behavior aligns with the stated purpose, the broad discovery commands (e.g., 'kubectl get deployments -A' and filesystem searches in SKILL.md) and the generation of destructive shell scripts constitute high-risk capabilities that could be misused. No evidence of malicious intent, such as data exfiltration or unauthorized persistence, was found.

能力标签

cryptocan-make-purchases

能力评估

⚠ Purpose & Capability

The described purpose (design chaos experiments) legitimately requires access to Kubernetes, Docker, monitoring, and the ability to run shell tools; however the registry metadata declares no required binaries, env vars, or credentials. That mismatch (metadata says 'none' while instructions call for kubectl, docker compose, curl, python3, and cluster credentials) is inconsistent and surprising.

⚠ Instruction Scope

The SKILL.md gives explicit, destructive steps (kubectl delete pod, scaling deployments, killing resources in namespace=production, AZ/region failover scenarios) and commands that read local files (find/grep) and call monitoring endpoints. Those actions are within the domain of chaos engineering but are high-risk; the instructions also reference $PROMETHEUS and assume access to cluster context and privileged service accounts (e.g., litmus-admin) without any safety gates in the metadata.

✓ Install Mechanism

This is an instruction-only skill with no install spec, which minimizes disk-write/install risk. However, lack of install does not remove risk because the runtime instructions invoke external tools and cluster operations.

⚠ Credentials

No environment variables or credentials are declared, but the runtime examples reference $PROMETHEUS and implicitly require Kubernetes cluster credentials (kubectl) and possibly privileged service accounts. The skill should declare these requirements and justify them; as-is, it asks for operations that need sensitive permissions but doesn't tell the user what will be needed.

ℹ Persistence & Privilege

always is false and autonomous invocation is allowed (the platform default). Autonomous invocation combined with destructive instructions could be dangerous if the agent is allowed to run skills without human confirmation — that risk stems from operational use, not the skill metadata itself.

版本历史

v1.0.0

Initial release of chaos-test-designer. - Design chaos engineering experiments to test system resilience through failure injection. - Generate steady-state hypotheses, define blast radius controls, and provide rollback procedures. - Supports experiment definitions for tools like Chaos Monkey, Litmus, Gremlin, and plain scripts. - Includes commands to design experiments, plan game days, audit system readiness, and report findings.

元数据

Slug chaos-test-designer

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题