← 返回 Skills 市场
h-mascot

benchmarking

作者 HiM · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
44
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install benchmarking
功能描述
Evaluate and compare models or providers on real-work tasks by creating, running, and expanding benchmarks that assess tool choice, failure recovery, and pro...
使用说明 (SKILL.md)

benchmarking

Use this skill when you need to:

  • benchmark models or agents
  • compare providers for real work
  • evaluate which model should own cron/ops/coding/research tasks
  • turn real work into reusable evaluation packs
  • create league tables, scorecards, or benchmark infographics

Goal

Benchmark operator leverage, not just output prettiness.

A good benchmark should tell you:

  • who chooses the right tool/runtime
  • who respects hidden constraints
  • who recovers from failure intelligently
  • who verifies before claiming success
  • who is worth routing real work to

Benchmark modes

1) Design mode

Use when you need to create a benchmark or full suite.

Expected outputs:

  • README.md
  • tasks.json
  • answer-key.json or answer-key guidelines
  • rubric.md
  • optional judge-notes.md

2) Execution mode

Use when you need to run models through an existing benchmark.

Expected outputs:

  • results-raw.json
  • results-scored.json
  • README.md
  • optional infographic / league-table PNG

3) Expansion mode

Use when you want to make a benchmark harder or add new tracks. Do not reinvent baseline tasks unless needed.

Benchmark design rules

  1. Ground tasks in real work, not toy prompts.
  2. Hide important constraints in environment/context, not all in the prompt.
  3. Weight judgment above syntax.
  4. Include at least one task where the right move is not to act now.
  5. Include at least one task about tool/runtime choice.
  6. Include at least one task about failure recovery.
  7. Include at least one task requiring proof-oriented completion.
  8. Separate model failure from provider/harness failure.

Large-run execution rules

  1. Confirm the actual model roster from the environment.
  2. Run in batches for large rosters.
  3. Save raw outputs after each batch.
  4. Score after raw outputs are locked.
  5. Generate charts last.
  6. Better a verified no-PNG pack than an incomplete run with pretty graphics.

Scoring rules

  • Prefer deterministic scoring for structured parts.
  • Add a human judge layer for operator judgment.
  • Keep syntax \x3C=20% of score for hard benchmarks.
  • Classify failures explicitly instead of treating all failures as model weakness.

Failure taxonomy

Use these classes when interpreting results:

  • MF — Model failure: reasoning/tool choice/accuracy failure with valid harness
  • HF — Harness failure: benchmark harness itself broke or mis-scored
  • PF — Provider failure: rate limit, provider unsupported, transport failure, 404 model path
  • CF — Context failure: prompt too large, missing required context, context-window collapse
  • PB — Policy block: task blocked by approval/policy/tool restriction
  • SF — Schema/format failure: invalid JSON/structure or repeated parsing failure
  • DF — Delegation failure: subagent/runtime orchestration failure, bad handoff, missing proof

Proof rules

Before saying DONE, provide:

  • benchmark pack path
  • results path(s)
  • score summary
  • list of failed/skipped models and why
  • note on anything still unverified

Recommended folder naming

Use:

  • output/benchmarks/YYYY-MM-DD-\x3Cbenchmark-name>/
  • keep machine-readable and human-readable files together

Suggested artifact pack

Every serious benchmark should produce most or all of:

  • README.md
  • tasks.json
  • answer-key.json
  • rubric.md
  • judge-notes.md
  • results-raw.json
  • results-scored.json
  • league-table.png or infographic
  • harness.py or equivalent scorer if automation exists

If spawning sub-agents

  • use a self-heal pattern
  • require checkpoints
  • require proof
  • do not let agents claim success without file-path evidence

Success criterion

A good benchmark changes routing decisions. If the result would not alter which model you use for real work, the benchmark is probably too soft.

安全使用建议
Install only if you want an agent to help create or run model benchmark artifacts. When using execution mode, confirm the model roster and provider access first because benchmark runs can consume quota and create local result files.
能力评估
Purpose & Capability
The stated purpose is benchmarking models and agents, and the artifact content consistently focuses on benchmark design, execution, scoring, failure taxonomy, and result reporting.
Instruction Scope
Instructions ask the agent to create benchmark packs, run comparisons, save raw and scored results, and optionally coordinate sub-agents with checkpoints and proof; these are disclosed and aligned with the benchmarking purpose.
Install Mechanism
The artifact contains only README.md and SKILL.md. There are no scripts, dependencies, installer hooks, executables, or package setup files.
Credentials
Execution mode can involve checking available model rosters and running batches, which may use provider access or quota, but this is expected for model benchmarking and should remain user-directed.
Persistence & Privilege
Persistence is limited to benchmark output files such as results and rubrics under a recommended output folder. The skill does not request elevated privileges, background services, credential stores, or long-running persistence.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install benchmarking
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /benchmarking 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Synced from SuperAda.ai resources
元数据
Slug benchmarking
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

benchmarking 是什么?

Evaluate and compare models or providers on real-work tasks by creating, running, and expanding benchmarks that assess tool choice, failure recovery, and pro... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 44 次。

如何安装 benchmarking?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install benchmarking」即可一键安装,无需额外配置。

benchmarking 是免费的吗?

是的,benchmarking 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

benchmarking 支持哪些平台?

benchmarking 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 benchmarking?

由 HiM(@h-mascot)开发并维护,当前版本 v1.0.0。

💬 留言讨论