/install benchmarking
benchmarking
Use this skill when you need to:
- benchmark models or agents
- compare providers for real work
- evaluate which model should own cron/ops/coding/research tasks
- turn real work into reusable evaluation packs
- create league tables, scorecards, or benchmark infographics
Goal
Benchmark operator leverage, not just output prettiness.
A good benchmark should tell you:
- who chooses the right tool/runtime
- who respects hidden constraints
- who recovers from failure intelligently
- who verifies before claiming success
- who is worth routing real work to
Benchmark modes
1) Design mode
Use when you need to create a benchmark or full suite.
Expected outputs:
README.mdtasks.jsonanswer-key.jsonor answer-key guidelinesrubric.md- optional
judge-notes.md
2) Execution mode
Use when you need to run models through an existing benchmark.
Expected outputs:
results-raw.jsonresults-scored.jsonREADME.md- optional infographic / league-table PNG
3) Expansion mode
Use when you want to make a benchmark harder or add new tracks. Do not reinvent baseline tasks unless needed.
Benchmark design rules
- Ground tasks in real work, not toy prompts.
- Hide important constraints in environment/context, not all in the prompt.
- Weight judgment above syntax.
- Include at least one task where the right move is not to act now.
- Include at least one task about tool/runtime choice.
- Include at least one task about failure recovery.
- Include at least one task requiring proof-oriented completion.
- Separate model failure from provider/harness failure.
Large-run execution rules
- Confirm the actual model roster from the environment.
- Run in batches for large rosters.
- Save raw outputs after each batch.
- Score after raw outputs are locked.
- Generate charts last.
- Better a verified no-PNG pack than an incomplete run with pretty graphics.
Scoring rules
- Prefer deterministic scoring for structured parts.
- Add a human judge layer for operator judgment.
- Keep syntax \x3C=20% of score for hard benchmarks.
- Classify failures explicitly instead of treating all failures as model weakness.
Failure taxonomy
Use these classes when interpreting results:
- MF — Model failure: reasoning/tool choice/accuracy failure with valid harness
- HF — Harness failure: benchmark harness itself broke or mis-scored
- PF — Provider failure: rate limit, provider unsupported, transport failure, 404 model path
- CF — Context failure: prompt too large, missing required context, context-window collapse
- PB — Policy block: task blocked by approval/policy/tool restriction
- SF — Schema/format failure: invalid JSON/structure or repeated parsing failure
- DF — Delegation failure: subagent/runtime orchestration failure, bad handoff, missing proof
Proof rules
Before saying DONE, provide:
- benchmark pack path
- results path(s)
- score summary
- list of failed/skipped models and why
- note on anything still unverified
Recommended folder naming
Use:
output/benchmarks/YYYY-MM-DD-\x3Cbenchmark-name>/- keep machine-readable and human-readable files together
Suggested artifact pack
Every serious benchmark should produce most or all of:
README.mdtasks.jsonanswer-key.jsonrubric.mdjudge-notes.mdresults-raw.jsonresults-scored.jsonleague-table.pngor infographicharness.pyor equivalent scorer if automation exists
If spawning sub-agents
- use a self-heal pattern
- require checkpoints
- require proof
- do not let agents claim success without file-path evidence
Success criterion
A good benchmark changes routing decisions. If the result would not alter which model you use for real work, the benchmark is probably too soft.
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install benchmarking - After installation, invoke the skill by name or use
/benchmarking - Provide required inputs per the skill's parameter spec and get structured output
What is benchmarking?
Evaluate and compare models or providers on real-work tasks by creating, running, and expanding benchmarks that assess tool choice, failure recovery, and pro... It is an AI Agent Skill for Claude Code / OpenClaw, with 44 downloads so far.
How do I install benchmarking?
Run "/install benchmarking" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is benchmarking free?
Yes, benchmarking is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does benchmarking support?
benchmarking is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created benchmarking?
It is built and maintained by HiM (@h-mascot); the current version is v1.0.0.