AB Test Eval
/install ab-test-eval
AB Test Eval — Automated Component Benchmarking via Subagents
Evaluate any OpenClaw component (skill, script, hook, cron job) by spawning parallel subagents and comparing arms. Supports multiple eval modes, auto-grading, and regression tracking.
Step 1: Choose the Eval Mode
Pick the mode that matches the user's intent:
| Mode | Question | Arms |
|---|---|---|
| baseline | Does the skill help at all? | with-skill vs without-skill |
| regression | Did changes break anything? | skill-v2 vs skill-v1 |
| model-swap | Works on another model? | model-A vs model-B |
| prompt-variant | Which description works better? | variant-A vs variant-B |
| trigger-accuracy | Dispatches correctly? | should-trigger vs should-not |
| adversarial | Robust against bad inputs? | clean vs perturbed |
| script-test | Script produces correct output? | script-A vs script-B |
| hook-dryrun | Hook responds correctly? | with-hook vs without |
| cron-dryrun | Cron payload does the right thing? | cron-run vs baseline |
| integration | Full stack works together? | full vs missing-component |
Default to baseline if unclear.
Step 2: Prepare Directory Structure
Create the eval workspace as a sibling to the skill directory:
\x3Cskill-dir>/evals/evals.json
\x3Cskill-dir>/\x3Cskill-name>-workspace/
iteration-1/
\x3Ceval-name>/
\x3Carm-a>/
outputs/commands.md
timing.json
grading.json
\x3Carm-b>/
outputs/commands.md
timing.json
grading.json
eval_metadata.json
benchmark.json
benchmark.md
iteration-2/
...
history.jsonl
Create directories with mkdir -p. Use descriptive arm names (e.g. with_skill, without_skill, new_version, old_version).
Step 3: Define or Generate Evals
If evals already exist
Read \x3Cskill-dir>/evals/evals.json and present the cases to the user for confirmation before running. Do not auto-run without sign-off.
If evals are missing
Generate them by reading the skill's SKILL.md and creating 4-6 realistic eval cases:
- Happy path — clear request the skill should nail
- Ambiguous request — could go multiple ways
- Edge case — unusual params or corner case
- Negative case — similar but should NOT trigger this skill
- Multi-step case — complex multi-tool request
- Adversarial case (if mode=adversarial) — misleading / typo / injected junk
Write to \x3Cskill-dir>/evals/evals.json:
{
"skill_name": "my-skill",
"evals": [
{
"id": 1,
"prompt": "Realistic user request",
"expected_output": "What correct behavior looks like",
"files": []
}
]
}
Then show them to the user: "Here are the test cases I plan to run. Do these look right, or do you want to add more?"
Wait for approval before spawning subagents.
Step 4: Efficiency Controls — Dry-Run Preview & Smoke Test
Before spawning expensive subagents, offer the user two efficiency controls (especially useful when eval count > 3 or arms > 2).
--dry Preview
Generate a preview report that lists exactly what will run, without spawning any subagents:
# Eval Preview Report
- Mode: baseline
- Evals: 4
- Arms per eval: 2 (with-skill, without-skill)
- Model: current
- Estimated subagent calls: 8
Evals:
1. happy-path-basic — 2 arms, 3 assertions
2. ambiguous-request — 2 arms, 3 assertions
Present this to the user and ask: "This looks like X evals across Y arms. Should I proceed, or do you want to trim the list?"
--smoke Smoke Test
If the user wants a quick confidence check, run only the first eval end-to-end (all arms + grading). This verifies the pipeline works before committing to the full run.
After a successful smoke test, ask: "Smoke test passed. Should I run the remaining N evals now?"
Step 5: Write Assertions
While waiting for user approval (or while subagents run), draft assertions in eval_metadata.json for each eval.
Save to \x3Cworkspace>/iteration-N/\x3Ceval-name>/eval_metadata.json:
{
"eval_id": 1,
"eval_name": "happy-path-basic",
"prompt": "The user's task prompt",
"assertions": [
{
"text": "Uses the --force flag",
"expected": true
},
{
"text": "Warns about OAuth timeout gotcha",
"expected": true
}
]
}
Assertions use text and expected fields. These are the basis for grading.
Step 6: Spawn Subagents in Parallel
For each eval, spawn all arms in the same turn. Launch as many as the environment allows concurrently.
Baseline mode
- with_skill: load
SKILL.md, execute prompt, save outputs - without_skill: same prompt, no skill, save outputs
Regression mode
- new_skill: load updated
SKILL.md - old_skill: load a snapshot of the previous version (make a
cp -rsnapshot before editing)
Model-swap mode
- model-a: run with skill + model A override
- model-b: run with skill + model B override
Prompt-variant mode
- variant-a: load skill variant A's
SKILL.md - variant-b: load skill variant B's
SKILL.md
Trigger-accuracy mode
Each prompt gets ONE subagent tasked as the dispatcher:
"You are the dispatcher. Given this user prompt, would you load
\x3Cskill-path>/SKILL.mdbefore responding? Answer yes/no and explain why." Save yes/no explanations, then grade TP/FP/TN/FN.
Adversarial mode
- clean: normal prompt + skill
- perturbed: prompt with typos / injected irrelevance / misleading framing + skill
Script-test mode
- Run the bundled script with controlled inputs and assert on stdout, exit code, and generated files.
- Arms can be: current-script vs previous-script, or script-with-skill-guidance vs naive-approach.
- Assertions focus on correctness, idempotency, and edge-case handling.
Hook-dryrun mode
- Simulate a hook event by spawning a subagent and telling it: "Pretend you are an OpenClaw agent receiving a
\x3Chook-type>event with this payload. Given this hook'sSKILL.mdor config, what would you do?" - Do NOT modify actual system hook registrations. This is a read-only simulation.
Cron-dryrun mode
- Extract the cron job's payload (task command or script path from
jobs.jsonor cron config). - Run the payload in an isolated subagent or
execdry-run context. - Assert on expected side effects, file outputs, or command sequence.
- Also verify the cron expression is valid and produces expected schedule times.
Integration mode
- Test the full stack: user prompt → skill dispatch → script execution → hook response.
- Arms: full-stack vs missing-script vs missing-hook vs skill-only.
Task template for standard arms:
Execute this task:
- Arm: \x3Carm-name>
- Skill path: \x3Cabsolute-path> or "none"
- Model override: \x3Cmodel> or "default"
- Task: \x3Ceval prompt>
- Input files: \x3Cfiles or "none">
- Save outputs to: \x3Cworkspace>/iteration-N/\x3Ceval-name>/\x3Carm>/outputs/commands.md
- Execute the task using available tools — if the subagent has tool access, run commands for real; if not, document what would be done.
Step 7: Capture Timing from Notifications
When each subagent completes, its notification includes total_tokens and duration_ms. This is the only chance to capture it.
Save to \x3Carm>/timing.json:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
Process each notification as it arrives rather than batching.
Step 8: Auto-Grade with LLM-as-Judge
Spawn a grading subagent per eval to compare all arms against the assertions:
Read the following files:
- \x3Cworkspace>/iteration-N/\x3Ceval-name>/\x3Carm-a>/outputs/commands.md
- \x3Cworkspace>/iteration-N/\x3Ceval-name>/\x3Carm-b>/outputs/commands.md
Eval prompt: \x3Cprompt>
Expected output: \x3Cexpected_output>
Grade each arm against these assertions:
\x3Cassertions from eval_metadata.json>
For each arm, save a separate grading.json with:
- A top-level "expectations" array with text/passed/evidence
- A "summary" with passed/failed/total/pass_rate
Save arm A results to: \x3Cworkspace>/iteration-N/\x3Ceval-name>/\x3Carm-a>/grading.json
Save arm B results to: \x3Cworkspace>/iteration-N/\x3Ceval-name>/\x3Carm-b>/grading.json
Each grading.json schema:
{
"expectations": [
{
"text": "Uses the --force flag",
"passed": true,
"evidence": "Output contains 'clawhub update --force'"
}
],
"summary": {
"passed": 3,
"failed": 0,
"total": 3,
"pass_rate": 1.0
}
}
For trigger-accuracy runs, save a separate trigger_grading.json with tp, fp, tn, fn tallies at the eval level.
Step 9: Aggregate and Generate Report
Write benchmark.json:
{
"metadata": {
"skill_name": "my-skill",
"mode": "baseline",
"model": "current-model-id",
"timestamp": "2026-04-11T05:30:00+08:00",
"evals_run": [1, 2, 3],
"arms": ["with_skill", "without_skill"]
},
"evals": [
{
"eval_id": 1,
"eval_name": "happy-path-basic",
"arms": {
"with_skill": {
"pass_rate": 1.0,
"passed": 3,
"total": 3,
"tokens": 12345,
"duration_seconds": 17.6
},
"without_skill": {
"pass_rate": 0.67,
"passed": 2,
"total": 3,
"tokens": 18590,
"duration_seconds": 29.0
}
}
}
],
"totals": {
"with_skill": { "pass_rate": 0.85, "passed": 17, "total": 20 },
"without_skill": { "pass_rate": 0.45, "passed": 9, "total": 20 }
},
"delta": {
"pass_rate": "+0.40"
},
"notes": [
"Arm with_skill consistently better on safety assertions",
"eval-3 edge-case shows no difference — consider strengthening skill"
]
}
Append a compact line to history.jsonl for regression tracking.
Then write benchmark.md with:
- Executive summary (delta, winner, biggest weaknesses)
- Per-eval breakdown table
- Notable failures with quotes
- Recommendations for improving the skill
Present the summary to the user directly in chat.
Step 10: Iterate Based on Feedback
- Discuss results with the user
- Improve the skill based on failed assertions
- Rerun into
iteration-(N+1)/ - Compare
history.jsonlentries for trend - Repeat until satisfied
Mode-Specific Notes
- Regression: Snapshot old skill before editing (
cp -r). Use previous version as baseline. - Model-swap: Use
sessions_spawnwithmodeloverride. - Prompt-variant: Create two temp skill copies with different descriptions.
- Trigger-accuracy: Generate 10 queries (5 should-trigger, 5 near-miss should-not). Grade precision/recall/F1.
- Adversarial: Perturbations include typos, irrelevant context injection, misleading framing. Report degradation score = clean_avg - perturbed_avg.
- Script-test: Run via
execfor deterministic results unless script invokes LLM. Check happy-path AND error handling. - Hook-dryrun: Simulate event via subagent with exact payload JSON. Do NOT modify actual hook registrations.
- Cron-dryrun: Validate cron expression and list next N execution times. If payload sends messages, use dry-run constraint.
- Integration: For missing-component arms, tell subagent: "You do NOT have access to \x3Ccomponent X>."
Hard Constraints
- Do not auto-run evals without user sign-off — present evals and wait for approval before spawning
- Respect
--dryand--smoke— offer preview / smoke-test paths to improve UX and reduce wasted tokens
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install ab-test-eval - After installation, invoke the skill by name or use
/ab-test-eval - Provide required inputs per the skill's parameter spec and get structured output
What is AB Test Eval?
Run A/B evaluation tests for any OpenClaw skill, script, hook, or cron job. Make sure to use this skill whenever the user mentions testing, benchmarking, com... It is an AI Agent Skill for Claude Code / OpenClaw, with 169 downloads so far.
How do I install AB Test Eval?
Run "/install ab-test-eval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is AB Test Eval free?
Yes, AB Test Eval is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does AB Test Eval support?
AB Test Eval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created AB Test Eval?
It is built and maintained by Siyuan Huang (@cyrushuang1995-cmyk); the current version is v2.1.2.