← Back to Skills Marketplace

benchmarking

Name: benchmarking
Author: h-mascot

by HiM · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ Security Clean

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install benchmarking

Description

Evaluate and compare models or providers on real-work tasks by creating, running, and expanding benchmarks that assess tool choice, failure recovery, and pro...

README (SKILL.md)

benchmarking

Use this skill when you need to:

benchmark models or agents
compare providers for real work
evaluate which model should own cron/ops/coding/research tasks
turn real work into reusable evaluation packs
create league tables, scorecards, or benchmark infographics

Goal

Benchmark operator leverage, not just output prettiness.

A good benchmark should tell you:

who chooses the right tool/runtime
who respects hidden constraints
who recovers from failure intelligently
who verifies before claiming success
who is worth routing real work to

Benchmark modes

1) Design mode

Use when you need to create a benchmark or full suite.

Expected outputs:

README.md
tasks.json
answer-key.json or answer-key guidelines
rubric.md
optional judge-notes.md

2) Execution mode

Use when you need to run models through an existing benchmark.

Expected outputs:

results-raw.json
results-scored.json
README.md
optional infographic / league-table PNG

3) Expansion mode

Use when you want to make a benchmark harder or add new tracks. Do not reinvent baseline tasks unless needed.

Benchmark design rules

Ground tasks in real work, not toy prompts.
Hide important constraints in environment/context, not all in the prompt.
Weight judgment above syntax.
Include at least one task where the right move is not to act now.
Include at least one task about tool/runtime choice.
Include at least one task about failure recovery.
Include at least one task requiring proof-oriented completion.
Separate model failure from provider/harness failure.

Large-run execution rules

Confirm the actual model roster from the environment.
Run in batches for large rosters.
Save raw outputs after each batch.
Score after raw outputs are locked.
Generate charts last.
Better a verified no-PNG pack than an incomplete run with pretty graphics.

Scoring rules

Prefer deterministic scoring for structured parts.
Add a human judge layer for operator judgment.
Keep syntax \x3C=20% of score for hard benchmarks.
Classify failures explicitly instead of treating all failures as model weakness.

Failure taxonomy

Use these classes when interpreting results:

MF — Model failure: reasoning/tool choice/accuracy failure with valid harness
HF — Harness failure: benchmark harness itself broke or mis-scored
PF — Provider failure: rate limit, provider unsupported, transport failure, 404 model path
CF — Context failure: prompt too large, missing required context, context-window collapse
PB — Policy block: task blocked by approval/policy/tool restriction
SF — Schema/format failure: invalid JSON/structure or repeated parsing failure
DF — Delegation failure: subagent/runtime orchestration failure, bad handoff, missing proof

Proof rules

Before saying DONE, provide:

benchmark pack path
results path(s)
score summary
list of failed/skipped models and why
note on anything still unverified

Recommended folder naming

Use:

output/benchmarks/YYYY-MM-DD-\x3Cbenchmark-name>/
keep machine-readable and human-readable files together

Suggested artifact pack

Every serious benchmark should produce most or all of:

README.md
tasks.json
answer-key.json
rubric.md
judge-notes.md
results-raw.json
results-scored.json
league-table.png or infographic
harness.py or equivalent scorer if automation exists

If spawning sub-agents

use a self-heal pattern
require checkpoints
require proof
do not let agents claim success without file-path evidence

Success criterion

A good benchmark changes routing decisions. If the result would not alter which model you use for real work, the benchmark is probably too soft.

Usage Guidance

Install only if you want an agent to help create or run model benchmark artifacts. When using execution mode, confirm the model roster and provider access first because benchmark runs can consume quota and create local result files.

Capability Assessment

✓ Purpose & Capability

The stated purpose is benchmarking models and agents, and the artifact content consistently focuses on benchmark design, execution, scoring, failure taxonomy, and result reporting.

✓ Instruction Scope

Instructions ask the agent to create benchmark packs, run comparisons, save raw and scored results, and optionally coordinate sub-agents with checkpoints and proof; these are disclosed and aligned with the benchmarking purpose.

✓ Install Mechanism

The artifact contains only README.md and SKILL.md. There are no scripts, dependencies, installer hooks, executables, or package setup files.

ℹ Credentials

Execution mode can involve checking available model rosters and running batches, which may use provider access or quota, but this is expected for model benchmarking and should remain user-directed.

✓ Persistence & Privilege

Persistence is limited to benchmark output files such as results and rubrics under a recommended output folder. The skill does not request elevated privileges, background services, credential stores, or long-running persistence.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install benchmarking
After installation, invoke the skill by name or use /benchmarking
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Synced from SuperAda.ai resources

Metadata

Slug benchmarking

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is benchmarking?

Evaluate and compare models or providers on real-work tasks by creating, running, and expanding benchmarks that assess tool choice, failure recovery, and pro... It is an AI Agent Skill for Claude Code / OpenClaw, with 44 downloads so far.

How do I install benchmarking?

Run "/install benchmarking" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is benchmarking free?

Yes, benchmarking is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does benchmarking support?

benchmarking is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created benchmarking?

It is built and maintained by HiM (@h-mascot); the current version is v1.0.0.

More Skills

benchmarking

benchmarking

Goal

Benchmark modes

1) Design mode

2) Execution mode

3) Expansion mode

Benchmark design rules

Large-run execution rules

Scoring rules

Failure taxonomy

Proof rules

Recommended folder naming

Suggested artifact pack

If spawning sub-agents

Success criterion

What is benchmarking?

How do I install benchmarking?

Is benchmarking free?

Which platforms does benchmarking support?

Who created benchmarking?

💬 Comments