/install benchmarking
benchmarking
Use this skill when you need to:
- benchmark models or agents
- compare providers for real work
- evaluate which model should own cron/ops/coding/research tasks
- turn real work into reusable evaluation packs
- create league tables, scorecards, or benchmark infographics
Goal
Benchmark operator leverage, not just output prettiness.
A good benchmark should tell you:
- who chooses the right tool/runtime
- who respects hidden constraints
- who recovers from failure intelligently
- who verifies before claiming success
- who is worth routing real work to
Benchmark modes
1) Design mode
Use when you need to create a benchmark or full suite.
Expected outputs:
README.mdtasks.jsonanswer-key.jsonor answer-key guidelinesrubric.md- optional
judge-notes.md
2) Execution mode
Use when you need to run models through an existing benchmark.
Expected outputs:
results-raw.jsonresults-scored.jsonREADME.md- optional infographic / league-table PNG
3) Expansion mode
Use when you want to make a benchmark harder or add new tracks. Do not reinvent baseline tasks unless needed.
Benchmark design rules
- Ground tasks in real work, not toy prompts.
- Hide important constraints in environment/context, not all in the prompt.
- Weight judgment above syntax.
- Include at least one task where the right move is not to act now.
- Include at least one task about tool/runtime choice.
- Include at least one task about failure recovery.
- Include at least one task requiring proof-oriented completion.
- Separate model failure from provider/harness failure.
Large-run execution rules
- Confirm the actual model roster from the environment.
- Run in batches for large rosters.
- Save raw outputs after each batch.
- Score after raw outputs are locked.
- Generate charts last.
- Better a verified no-PNG pack than an incomplete run with pretty graphics.
Scoring rules
- Prefer deterministic scoring for structured parts.
- Add a human judge layer for operator judgment.
- Keep syntax \x3C=20% of score for hard benchmarks.
- Classify failures explicitly instead of treating all failures as model weakness.
Failure taxonomy
Use these classes when interpreting results:
- MF — Model failure: reasoning/tool choice/accuracy failure with valid harness
- HF — Harness failure: benchmark harness itself broke or mis-scored
- PF — Provider failure: rate limit, provider unsupported, transport failure, 404 model path
- CF — Context failure: prompt too large, missing required context, context-window collapse
- PB — Policy block: task blocked by approval/policy/tool restriction
- SF — Schema/format failure: invalid JSON/structure or repeated parsing failure
- DF — Delegation failure: subagent/runtime orchestration failure, bad handoff, missing proof
Proof rules
Before saying DONE, provide:
- benchmark pack path
- results path(s)
- score summary
- list of failed/skipped models and why
- note on anything still unverified
Recommended folder naming
Use:
output/benchmarks/YYYY-MM-DD-\x3Cbenchmark-name>/- keep machine-readable and human-readable files together
Suggested artifact pack
Every serious benchmark should produce most or all of:
README.mdtasks.jsonanswer-key.jsonrubric.mdjudge-notes.mdresults-raw.jsonresults-scored.jsonleague-table.pngor infographicharness.pyor equivalent scorer if automation exists
If spawning sub-agents
- use a self-heal pattern
- require checkpoints
- require proof
- do not let agents claim success without file-path evidence
Success criterion
A good benchmark changes routing decisions. If the result would not alter which model you use for real work, the benchmark is probably too soft.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install benchmarking - 安装完成后,直接呼叫该 Skill 的名称或使用
/benchmarking触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
benchmarking 是什么?
Evaluate and compare models or providers on real-work tasks by creating, running, and expanding benchmarks that assess tool choice, failure recovery, and pro... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 44 次。
如何安装 benchmarking?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install benchmarking」即可一键安装,无需额外配置。
benchmarking 是免费的吗?
是的,benchmarking 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
benchmarking 支持哪些平台?
benchmarking 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 benchmarking?
由 HiM(@h-mascot)开发并维护,当前版本 v1.0.0。