Description

Run A/B evaluation tests for any OpenClaw skill, script, hook, or cron job. Make sure to use this skill whenever the user mentions testing, benchmarking, com...

README (SKILL.md)

AB Test Eval — Automated Component Benchmarking via Subagents

Name: AB Test Eval
Author: cyrushuang1995-cmyk

Evaluate any OpenClaw component (skill, script, hook, cron job) by spawning parallel subagents and comparing arms. Supports multiple eval modes, auto-grading, and regression tracking.

Step 1: Choose the Eval Mode

Pick the mode that matches the user's intent:

Mode	Question	Arms
baseline	Does the skill help at all?	with-skill vs without-skill
regression	Did changes break anything?	skill-v2 vs skill-v1
model-swap	Works on another model?	model-A vs model-B
prompt-variant	Which description works better?	variant-A vs variant-B
trigger-accuracy	Dispatches correctly?	should-trigger vs should-not
adversarial	Robust against bad inputs?	clean vs perturbed
script-test	Script produces correct output?	script-A vs script-B
hook-dryrun	Hook responds correctly?	with-hook vs without
cron-dryrun	Cron payload does the right thing?	cron-run vs baseline
integration	Full stack works together?	full vs missing-component

Default to baseline if unclear.

Step 2: Prepare Directory Structure

Create the eval workspace as a sibling to the skill directory:

\x3Cskill-dir>/evals/evals.json
\x3Cskill-dir>/\x3Cskill-name>-workspace/
  iteration-1/
    \x3Ceval-name>/
      \x3Carm-a>/
        outputs/commands.md
        timing.json
        grading.json
      \x3Carm-b>/
        outputs/commands.md
        timing.json
        grading.json
      eval_metadata.json
    benchmark.json
    benchmark.md
  iteration-2/
    ...
  history.jsonl

Create directories with mkdir -p. Use descriptive arm names (e.g. with_skill, without_skill, new_version, old_version).

Step 3: Define or Generate Evals

If evals already exist

Read \x3Cskill-dir>/evals/evals.json and present the cases to the user for confirmation before running. Do not auto-run without sign-off.

If evals are missing

Generate them by reading the skill's SKILL.md and creating 4-6 realistic eval cases:

Happy path — clear request the skill should nail
Ambiguous request — could go multiple ways
Edge case — unusual params or corner case
Negative case — similar but should NOT trigger this skill
Multi-step case — complex multi-tool request
Adversarial case (if mode=adversarial) — misleading / typo / injected junk

Write to \x3Cskill-dir>/evals/evals.json:

{
  "skill_name": "my-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "Realistic user request",
      "expected_output": "What correct behavior looks like",
      "files": []
    }
  ]
}

Then show them to the user: "Here are the test cases I plan to run. Do these look right, or do you want to add more?"

Wait for approval before spawning subagents.

Step 4: Efficiency Controls — Dry-Run Preview & Smoke Test

Before spawning expensive subagents, offer the user two efficiency controls (especially useful when eval count > 3 or arms > 2).

`--dry` Preview

Generate a preview report that lists exactly what will run, without spawning any subagents:

# Eval Preview Report
- Mode: baseline
- Evals: 4
- Arms per eval: 2 (with-skill, without-skill)
- Model: current
- Estimated subagent calls: 8

Evals:
1. happy-path-basic — 2 arms, 3 assertions
2. ambiguous-request — 2 arms, 3 assertions

Present this to the user and ask: "This looks like X evals across Y arms. Should I proceed, or do you want to trim the list?"

`--smoke` Smoke Test

If the user wants a quick confidence check, run only the first eval end-to-end (all arms + grading). This verifies the pipeline works before committing to the full run.

After a successful smoke test, ask: "Smoke test passed. Should I run the remaining N evals now?"

Step 5: Write Assertions

While waiting for user approval (or while subagents run), draft assertions in eval_metadata.json for each eval.

Save to \x3Cworkspace>/iteration-N/\x3Ceval-name>/eval_metadata.json:

{
  "eval_id": 1,
  "eval_name": "happy-path-basic",
  "prompt": "The user's task prompt",
  "assertions": [
    {
      "text": "Uses the --force flag",
      "expected": true
    },
    {
      "text": "Warns about OAuth timeout gotcha",
      "expected": true
    }
  ]
}

Assertions use text and expected fields. These are the basis for grading.

Step 6: Spawn Subagents in Parallel

For each eval, spawn all arms in the same turn. Launch as many as the environment allows concurrently.

Baseline mode

with_skill: load SKILL.md, execute prompt, save outputs
without_skill: same prompt, no skill, save outputs

Regression mode

new_skill: load updated SKILL.md
old_skill: load a snapshot of the previous version (make a cp -r snapshot before editing)

Model-swap mode

model-a: run with skill + model A override
model-b: run with skill + model B override

Prompt-variant mode

variant-a: load skill variant A's SKILL.md
variant-b: load skill variant B's SKILL.md

Trigger-accuracy mode

Each prompt gets ONE subagent tasked as the dispatcher:

"You are the dispatcher. Given this user prompt, would you load \x3Cskill-path>/SKILL.md before responding? Answer yes/no and explain why." Save yes/no explanations, then grade TP/FP/TN/FN.

Adversarial mode

clean: normal prompt + skill
perturbed: prompt with typos / injected irrelevance / misleading framing + skill

Script-test mode

Run the bundled script with controlled inputs and assert on stdout, exit code, and generated files.
Arms can be: current-script vs previous-script, or script-with-skill-guidance vs naive-approach.
Assertions focus on correctness, idempotency, and edge-case handling.

Hook-dryrun mode

Simulate a hook event by spawning a subagent and telling it: "Pretend you are an OpenClaw agent receiving a \x3Chook-type> event with this payload. Given this hook's SKILL.md or config, what would you do?"
Do NOT modify actual system hook registrations. This is a read-only simulation.

Cron-dryrun mode

Extract the cron job's payload (task command or script path from jobs.json or cron config).
Run the payload in an isolated subagent or exec dry-run context.
Assert on expected side effects, file outputs, or command sequence.
Also verify the cron expression is valid and produces expected schedule times.

Integration mode

Test the full stack: user prompt → skill dispatch → script execution → hook response.
Arms: full-stack vs missing-script vs missing-hook vs skill-only.

Task template for standard arms:

Execute this task:
- Arm: \x3Carm-name>
- Skill path: \x3Cabsolute-path> or "none"
- Model override: \x3Cmodel> or "default"
- Task: \x3Ceval prompt>
- Input files: \x3Cfiles or "none">
- Save outputs to: \x3Cworkspace>/iteration-N/\x3Ceval-name>/\x3Carm>/outputs/commands.md
- Execute the task using available tools — if the subagent has tool access, run commands for real; if not, document what would be done.

Step 7: Capture Timing from Notifications

When each subagent completes, its notification includes total_tokens and duration_ms. This is the only chance to capture it.

Save to \x3Carm>/timing.json:

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

Process each notification as it arrives rather than batching.

Step 8: Auto-Grade with LLM-as-Judge

Spawn a grading subagent per eval to compare all arms against the assertions:

Read the following files:
- \x3Cworkspace>/iteration-N/\x3Ceval-name>/\x3Carm-a>/outputs/commands.md
- \x3Cworkspace>/iteration-N/\x3Ceval-name>/\x3Carm-b>/outputs/commands.md

Eval prompt: \x3Cprompt>
Expected output: \x3Cexpected_output>

Grade each arm against these assertions:
\x3Cassertions from eval_metadata.json>

For each arm, save a separate grading.json with:
- A top-level "expectations" array with text/passed/evidence
- A "summary" with passed/failed/total/pass_rate

Save arm A results to: \x3Cworkspace>/iteration-N/\x3Ceval-name>/\x3Carm-a>/grading.json
Save arm B results to: \x3Cworkspace>/iteration-N/\x3Ceval-name>/\x3Carm-b>/grading.json

Each grading.json schema:

{
  "expectations": [
    {
      "text": "Uses the --force flag",
      "passed": true,
      "evidence": "Output contains 'clawhub update --force'"
    }
  ],
  "summary": {
    "passed": 3,
    "failed": 0,
    "total": 3,
    "pass_rate": 1.0
  }
}

For trigger-accuracy runs, save a separate trigger_grading.json with tp, fp, tn, fn tallies at the eval level.

Step 9: Aggregate and Generate Report

Write benchmark.json:

{
  "metadata": {
    "skill_name": "my-skill",
    "mode": "baseline",
    "model": "current-model-id",
    "timestamp": "2026-04-11T05:30:00+08:00",
    "evals_run": [1, 2, 3],
    "arms": ["with_skill", "without_skill"]
  },
  "evals": [
    {
      "eval_id": 1,
      "eval_name": "happy-path-basic",
      "arms": {
        "with_skill": {
          "pass_rate": 1.0,
          "passed": 3,
          "total": 3,
          "tokens": 12345,
          "duration_seconds": 17.6
        },
        "without_skill": {
          "pass_rate": 0.67,
          "passed": 2,
          "total": 3,
          "tokens": 18590,
          "duration_seconds": 29.0
        }
      }
    }
  ],
  "totals": {
    "with_skill": { "pass_rate": 0.85, "passed": 17, "total": 20 },
    "without_skill": { "pass_rate": 0.45, "passed": 9, "total": 20 }
  },
  "delta": {
    "pass_rate": "+0.40"
  },
  "notes": [
    "Arm with_skill consistently better on safety assertions",
    "eval-3 edge-case shows no difference — consider strengthening skill"
  ]
}

Append a compact line to history.jsonl for regression tracking.

Then write benchmark.md with:

Executive summary (delta, winner, biggest weaknesses)
Per-eval breakdown table
Notable failures with quotes
Recommendations for improving the skill

Present the summary to the user directly in chat.

Step 10: Iterate Based on Feedback

Discuss results with the user
Improve the skill based on failed assertions
Rerun into iteration-(N+1)/
Compare history.jsonl entries for trend
Repeat until satisfied

Mode-Specific Notes

Regression: Snapshot old skill before editing (cp -r). Use previous version as baseline.
Model-swap: Use sessions_spawn with model override.
Prompt-variant: Create two temp skill copies with different descriptions.
Trigger-accuracy: Generate 10 queries (5 should-trigger, 5 near-miss should-not). Grade precision/recall/F1.
Adversarial: Perturbations include typos, irrelevant context injection, misleading framing. Report degradation score = clean_avg - perturbed_avg.
Script-test: Run via exec for deterministic results unless script invokes LLM. Check happy-path AND error handling.
Hook-dryrun: Simulate event via subagent with exact payload JSON. Do NOT modify actual hook registrations.
Cron-dryrun: Validate cron expression and list next N execution times. If payload sends messages, use dry-run constraint.
Integration: For missing-component arms, tell subagent: "You do NOT have access to \x3Ccomponent X>."

Hard Constraints

Do not auto-run evals without user sign-off — present evals and wait for approval before spawning
Respect --dry and --smoke — offer preview / smoke-test paths to improve UX and reduce wasted tokens

Usage Guidance

This skill appears to do what it claims: it creates local eval workspaces, reads a target skill's SKILL.md to generate tests, snapshots versions with cp, and spawns subagents to run arms. Before installing/using it, consider: 1) it will read other skills' SKILL.md files — do those files contain anything you consider sensitive? 2) it can spawn many parallel subagents; use the built-in dry-run and smoke-test options first, and limit concurrency to avoid unexpected resource usage or costs; 3) confirm it will not be run automatically without your explicit approval (the instructions say it waits for sign-off). If you want additional safety, run it in a restricted environment or with a small eval set first and inspect the generated evals/eval_metadata.json before allowing full runs.

Capability Analysis

Type: OpenClaw Skill Name: ab-test-eval Version: 2.1.2 The skill is a benchmarking framework that automates the evaluation of OpenClaw components by spawning subagents to execute tasks. It contains high-risk capabilities, specifically instructions in SKILL.md (Step 6) for subagents to "run commands for real" and the automated generation of test cases from component files (Step 3). While these features are plausibly necessary for its stated purpose of integration testing and benchmarking, and the skill includes safety controls like user sign-off and dry-run modes, the ability to trigger arbitrary command execution via subagents based on generated prompts constitutes a significant risk profile.

Capability Tags

cryptorequires-oauth-token

Capability Assessment

✓ Purpose & Capability

Name and description match what the SKILL.md instructs: creating an eval workspace, reading a target skill's SKILL.md, generating eval cases, snapshotting via cp, and orchestrating parallel subagents. Required binaries (mkdir, cp) are appropriate and proportional for workspace creation and making snapshots.

ℹ Instruction Scope

Instructions largely stay inside the stated purpose, but they do require reading other skills' SKILL.md files and spawning parallel subagents to run arms. Reading SKILL.md for target components is necessary for generating realistic evals, and the doc explicitly requires user approval before running. These behaviors are expected for an evaluator but do mean the skill will access other skills' contents and may run many subagents (resource use).

✓ Install Mechanism

No install spec; instruction-only skills are lowest risk because nothing is downloaded or written at install time.

✓ Credentials

No environment variables, credentials, or config paths are requested. The declared requirements (mkdir, cp) are minimal and appropriate.

✓ Persistence & Privilege

always:false and normal model invocation. The skill writes evaluation workspaces and history files in a sibling workspace (expected for a tester) but does not request persistent platform-wide privileges or modify other skills' configurations.

Version History

v2.1.2

Remove shasum from metadata bins — no longer used in SKILL.md body after fingerprint/cache removal. Keeping mkdir and cp which are actually referenced.

v2.1.1

Restore shasum and add cp to metadata bins. Scanner requires all commands used in SKILL.md body to be declared in metadata.

v2.1.0

Remove fingerprint/cache mechanism and shasum dependency. Caching was over-engineered for A/B testing and triggered a misleading 'Crypto' capability signal on ClawHub. Each eval now runs independently for reliable results.

v2.0.3

Add metadata.openclaw.requires.bins declaration for shasum and mkdir to match ClawHub security scan expectations.

v2.0.2

Declare operational behaviors (subagent spawning, exec/shasum, file read/write) in description to address ClawHub suspicious scan.

v2.0.1

Polish: improved description with pushy trigger language per skill-creator best practices, removed redundant When to Activate section (now in description), removed hardcoded DO NOT run constraint, compressed mode-specific reference from ~60 lines to ~12 lines, total SKILL.md now 374 lines.

v2.0.0

Full rewrite: 10 eval modes (baseline, regression, model-swap, prompt-variant, trigger-accuracy, adversarial, script-test, hook-dryrun, cron-dryrun, integration). Self-contained — no external dependencies. LLM-as-Judge grading, fingerprint-based result caching, --dry preview and --smoke test, per-arm grading.json, benchmark.json report generation, and history.jsonl regression tracking.

v1.0.2

Add example evals, remove empty dirs

v1.0.1

Add example evals.json, remove empty dirs

v1.0.0

Initial release: AB test eval framework for OpenClaw skills

Metadata

Slug ab-test-eval

Version 2.1.2

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 10

Frequently Asked Questions

What is AB Test Eval?

Run A/B evaluation tests for any OpenClaw skill, script, hook, or cron job. Make sure to use this skill whenever the user mentions testing, benchmarking, com... It is an AI Agent Skill for Claude Code / OpenClaw, with 169 downloads so far.

How do I install AB Test Eval?

Run "/install ab-test-eval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is AB Test Eval free?

Yes, AB Test Eval is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does AB Test Eval support?

AB Test Eval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created AB Test Eval?

It is built and maintained by Siyuan Huang (@cyrushuang1995-cmyk); the current version is v2.1.2.

More Skills

AB Test Eval