Chaos Test Designer
/install chaos-test-designer
Chaos Test Designer
Design chaos engineering experiments that safely test your system's resilience. Define steady-state hypotheses, inject controlled failures (service crashes, network partitions, resource exhaustion, dependency outages), measure impact, and produce runnable experiment definitions for Chaos Monkey, Litmus, Gremlin, or plain scripts.
Use when: "design chaos test", "test system resilience", "what happens if this service dies", "failure injection", "game day planning", "chaos engineering", "test our fallbacks", or before declaring a service production-ready.
Commands
1. design — Create Chaos Experiment
Step 1: Understand the System
# Discover services and dependencies
kubectl get deployments -A 2>/dev/null | grep -v kube-system
docker compose config --services 2>/dev/null
# Or read architecture docs
find . -maxdepth 3 -name "*.md" | xargs grep -li "architecture\|topology\|dependency" 2>/dev/null
Map the dependency graph:
- Which services call which?
- What are the single points of failure?
- Where are the circuit breakers, retries, fallbacks?
- What external dependencies exist (databases, caches, queues, third-party APIs)?
Step 2: Define Steady-State Hypothesis
Before breaking anything, define what "normal" looks like:
## Steady-State Hypothesis
- Homepage loads in \x3C 500ms (p95)
- API error rate \x3C 0.1%
- Orders processed within 30 seconds of submission
- Background jobs backlog \x3C 100 items
- All health check endpoints return 200
This is the baseline you'll compare against during the experiment.
Step 3: Select Failure Mode
Common failure modes ranked by severity:
Level 1 — Service Failures (start here)
- Kill a single pod/container instance
- Restart a service with delay
- Reduce replica count to 1
Level 2 — Network Failures
- Add latency (100ms, 500ms, 2000ms) to inter-service calls
- Drop 10% of packets to a specific service
- DNS resolution failures
- Block traffic to a specific dependency
Level 3 — Resource Exhaustion
- Fill disk to 95%
- Consume all available memory (OOM scenarios)
- Saturate CPU
- Exhaust database connection pool
- Fill message queue to capacity
Level 4 — Dependency Failures
- External API returns 500 for all requests
- Database becomes read-only
- Cache becomes unavailable
- Message broker stops accepting messages
Level 5 — Infrastructure Failures (advanced)
- Availability zone failure (kill all resources in one AZ)
- Region failover
- Complete network partition between services
Step 4: Generate Experiment Definition
# Chaos Experiment: [Service] [Failure Type]
# Generated by chaos-test-designer
experiment:
name: "payment-service-pod-kill"
description: "Kill payment service pod to verify retry logic and circuit breaker"
steady_state:
- probe: http
url: "http://payment-service:8080/health"
expect_status: 200
- probe: prometheus
query: "rate(http_requests_total{service='payment',status='5xx'}[1m])"
expect: "\x3C 0.01"
method:
- action: kill-pod
target:
namespace: production
label_selector: "app=payment-service"
count: 1
rollback:
- action: scale
target:
namespace: production
deployment: payment-service
replicas: 3
controls:
blast_radius: "single pod in production"
duration: "5 minutes"
abort_conditions:
- "error_rate > 5%"
- "p99_latency > 10s"
business_hours_only: true
For Kubernetes (Litmus):
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payment-chaos
spec:
appinfo:
appns: production
applabel: app=payment-service
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "300"
- name: CHAOS_INTERVAL
value: "60"
- name: FORCE
value: "false"
For plain bash:
#!/usr/bin/env bash
# Chaos: Kill payment-service pod
set -euo pipefail
echo "📊 Capturing steady state..."
BASELINE_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Baseline error rate: $BASELINE_ERROR_RATE"
echo "💥 Injecting failure: killing one payment-service pod..."
POD=$(kubectl get pods -l app=payment-service -o jsonpath='{.items[0].metadata.name}')
kubectl delete pod "$POD" --grace-period=0
echo "⏱️ Observing for 5 minutes..."
sleep 300
echo "📊 Measuring impact..."
POST_ERROR_RATE=$(curl -s "$PROMETHEUS/api/v1/query" --data-urlencode \
'query=rate(http_requests_total{service="payment",status=~"5.."}[1m])' | \
python3 -c "import json,sys;print(json.load(sys.stdin)['data']['result'][0]['value'][1])")
echo "Post-chaos error rate: $POST_ERROR_RATE"
echo "✅ Verifying recovery..."
kubectl get pods -l app=payment-service
2. gameday — Plan a Game Day
Generate a full game day schedule:
- Pre-game briefing (objectives, safety controls, escalation contacts)
- Experiment sequence (ordered by risk, with breaks between)
- Observation assignments (who monitors what dashboard)
- Go/no-go criteria between experiments
- Post-game debrief template
3. audit — Assess Chaos Readiness
Before running chaos experiments, verify the system has:
- Health check endpoints on every service
- Monitoring and alerting in place
- Circuit breakers or retry logic
- Graceful degradation modes
- Runbooks for common failures
- Rollback procedures tested recently
Score readiness 0-100 and recommend prerequisites before first chaos experiment.
4. report — Analyze Experiment Results
After running an experiment, produce a findings report:
- Did the steady-state hypothesis hold?
- What broke? Was it expected?
- How long until the system self-healed?
- What's the blast radius in production vs expected?
- Remediation recommendations (add circuit breaker, fix retry logic, add redundancy)
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install chaos-test-designer - After installation, invoke the skill by name or use
/chaos-test-designer - Provide required inputs per the skill's parameter spec and get structured output
What is Chaos Test Designer?
Design chaos engineering experiments to test system resilience. Generate failure injection scenarios, define steady-state hypotheses, blast radius controls,... It is an AI Agent Skill for Claude Code / OpenClaw, with 30 downloads so far.
How do I install Chaos Test Designer?
Run "/install chaos-test-designer" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Chaos Test Designer free?
Yes, Chaos Test Designer is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Chaos Test Designer support?
Chaos Test Designer is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Chaos Test Designer?
It is built and maintained by charlie-morrison (@charlie-morrison); the current version is v1.0.0.