← Back to Skills Marketplace
lanyasheng

Benchmark Store

by _silhouette · GitHub ↗ · v1.1.1 · MIT-0
cross-platform ⚠ suspicious
117
Downloads
0
Stars
1
Active Installs
3
Versions
Install in OpenClaw
/install benchmark-store
Description
当需要初始化基准数据库、对比 skill 评分与历史基线、查看 Pareto front 是否有维度回退、或查阅质量分级标准时使用。不用于给候选打分(用 improvement-discriminator)或自动改进(用 improvement-learner)。
README (SKILL.md)

Benchmark Store

Frozen benchmarks, hidden tests, Pareto front, and evaluation standards.

When to Use

  • 初始化或查询基准数据库
  • 对比 skill 评分与冻结基线
  • 检查 Pareto front(任何维度回退 >5% 即拒绝)
  • 查阅质量分级标准(POWERFUL/SOLID/GENERIC/WEAK)
  • 添加新的冻结测试用例到基准库
  • 查看某个 skill 在所有维度上的历史最优分数
  • 为 improvement-gate 的 RegressionGate 提供 Pareto 基线数据
  • 在批量评估场景下列出所有已注册的 benchmark 条目

When NOT to Use

  • 给候选打分 → use improvement-discriminator
  • 自动改进 → use improvement-learner
  • 全流程 → use improvement-orchestrator
  • 执行变更 → use improvement-executor
  • 门禁决策 → use improvement-gate(它消费 benchmark-store 的数据)

Quality Tiers

Tier Score Ship?
POWERFUL ⭐ ≥ 85% Marketplace ready
SOLID 70–84% GitHub
GENERIC 55–69% Needs iteration
WEAK \x3C 55% Reject or rewrite

分级基于所有维度的加权综合分。每个维度在 evaluation-standards.md 中有独立权重。accuracy 和 coverage 权重最高(各 0.2),security 权重 0.15,其余维度各 0.1-0.15。分级用于 improvement-orchestrator 决定是否继续迭代:POWERFUL 即停止,WEAK 必须重试。

Why Pareto Front Instead of Weighted Average

Tradeoff: 用加权平均分来判断回退看似简单(一个数字对比),但它会隐藏维度间的此消彼长。例如 accuracy 从 0.9 降到 0.6 而 coverage 从 0.5 升到 0.8,加权平均可能持平甚至上升,但 accuracy 的大幅回退被掩盖了。Because Pareto front 逐维度检查,任何单一维度回退超过 5% 阈值都会触发拒绝,确保改进是真正的帕累托改进(没有维度变差)而非以牺牲某个维度为代价的伪改进。

这个设计的代价是改进更难被接受(通过率约 40-50%),但避免了"越改越偏"的漂移问题。在 autoloop 场景下,Pareto 保护是防止 LLM 在多轮迭代中逐步丢失已有优势的关键机制。

Pareto Front

ParetoFront.check_regression(new_scores) → {"regressed": bool, "regressions": [...]}
# 5% tolerance — minor fluctuations allowed

Pareto 回退检查的完整流程:加载历史最优分 → 逐维度对比 → 超过 5% 阈值的维度标记为 regression → 任何一个 regression 即判定整体 regressed=True。

# Pareto regression check — full example
from lib.pareto import ParetoFront

pf = ParetoFront("state/pareto.json")
new_scores = {"accuracy": 0.82, "coverage": 0.90, "reliability": 1.0,
              "efficiency": 0.95, "security": 1.0, "trigger_quality": 0.75}
result = pf.check_regression(new_scores)
# result = {"regressed": True,
#           "regressions": [{"dim": "accuracy", "before": 0.90, "after": 0.82, "delta": -0.08}]}
# accuracy dropped 8% (> 5% threshold) → regression detected → candidate rejected

\x3Cexample> 正确: 检查 Pareto front 是否有回退 $ python3 -c "from lib.pareto import ParetoFront; pf = ParetoFront('state/pareto.json'); print(pf.check_regression({'accuracy': 0.9, 'coverage': 0.8}))" → {"regressed": false, "regressions": []} # 无回退,可以接受 \x3C/example>

\x3Canti-example> 错误: 用 benchmark-store 给候选打分 → benchmark-store 只存数据,打分用 improvement-discriminator \x3C/anti-example>

CLI

# List benchmarks
python3 scripts/benchmark_db.py --action list --db-path benchmarks.db

# Compare skill against baselines
python3 scripts/benchmark_db.py --action compare --skill-path /path/to/skill --category general --db-path benchmarks.db

# Add a benchmark
python3 scripts/benchmark_db.py --action add --category general --test-name "test1" --db-path benchmarks.db

Output Artifacts

Request Deliverable
Init SQLite database with schema
Compare JSON comparison with per-dimension delta
Pareto check JSON with regressed flag and details
List All registered benchmarks with metadata
Add Confirmation of new benchmark insertion

Compare 输出包含每个维度的 before/after/delta 三元组,以及整体的 tier_before/tier_after 分级变化。当从 SOLID 升级到 POWERFUL 或从 GENERIC 降级到 WEAK 时会额外标注 tier_changed: true。

Related Skills

  • improvement-learner: Imports ParetoFront for self-improvement loop
  • improvement-gate: RegressionGate uses Pareto data to block regressions
  • improvement-discriminator: References evaluation standards for scoring context
  • improvement-orchestrator: Full pipeline queries benchmark-store at gate stage
  • autoloop-controller: Uses historical benchmark data to detect convergence plateau

Data Files

  • data/evaluation-standards.md — Quality tiers, dimensions, weights (v2.0.0)
  • data/fixtures/ — Frozen test fixtures
  • state/pareto.json — Per-skill Pareto front historical best scores
  • benchmarks.db — SQLite database storing all benchmark entries and results
Usage Guidance
What to check before installing/using this skill: - Functionality vs CLI: The compare CLI examples assume the script will run by itself, but scripts/benchmark_db.py's compare path expects an evaluator callable (i.e., a function to run tests and return scores). Confirm how you will supply/implement that evaluator when invoking --action compare; otherwise the script deliberately refuses to return mocked scores. - Hidden tests / secrets: Hidden test loading and decryption APIs require a password/key (not declared as an env var). Decide how to store/provide any decryption password (avoid embedding in repo). Ensure you understand where any decrypted test data will be kept in plaintext and who can access it. - Filesystem writes: The skill will create or update benchmarks.db and state/pareto.json and can export reports. Run it in a sandbox or with a controlled working directory to avoid accidental writes to sensitive locations. - Code review: Because this package contains Python code (interfaces/, scripts/), review the full main/entry code paths (the truncated main() in benchmark_db.py and any functions that may import dynamic evaluators or execute code) to ensure there is no dynamic execution of untrusted code (exec/eval/os.system). The static check here did not flag obvious exec/os.system usage, but you should verify the omitted/truncated portions. - No network creds by default: The package does not request API keys or network credentials; if you later integrate external benchmark imports or remote data sources, require explicit review and avoid putting secrets into repo files. - Next steps: If you intend to run comparisons, prepare or inspect an evaluator implementation that conforms to the expected signature, confirm how encrypted hidden tests are unlocked (password handling), and run the skill in a disposable environment first. If you want, I can: (1) point out the exact lines where compare raises on missing evaluator and where HiddenTest decryption is expected, or (2) suggest a safe wrapper/driver to call the compare path with a controlled evaluator implementation.
Capability Analysis
Type: OpenClaw Skill Name: benchmark-store Version: 1.1.1 The benchmark-store skill bundle is a framework designed for the evaluation and regression testing of AI agent skills. It implements a SQLite-based benchmark database (scripts/benchmark_db.py), Pareto front tracking for multi-dimensional performance analysis (scripts/pareto.py), and mechanisms for 'hidden' tests to prevent model overfitting (interfaces/hidden_tests.py). While the bundle contains known attack payloads (e.g., SQL injection and path traversal strings in data/test-cases.yaml and data/red-team-guide.md), these are explicitly documented as test vectors for 'Red Teaming' other skills to ensure their safety. The code follows secure practices such as using parameterized SQL queries and SHA256 integrity verification, and lacks any indicators of data exfiltration or unauthorized execution.
Capability Assessment
Purpose & Capability
Name/description align with the provided files: there are scripts for a benchmarks DB, Pareto checks, evaluation standards, fixtures, and interfaces for frozen/hidden tests. However there are small inconsistencies (SKILL.md frontmatter version 0.1.0 vs registry version 1.1.1) and the runtime CLI/compare workflow described in SKILL.md does not cleanly match the implementation details in scripts/benchmark_db.py (see instruction_scope). Overall capability requested (local DB, JSON state, hidden tests) is coherent with the stated purpose.
Instruction Scope
SKILL.md instructs running scripts/benchmark_db.py --action compare as a CLI action, but the compare implementation (compare_with_benchmark) requires an evaluator callable parameter and will raise ValueError if evaluator is None. The SKILL.md examples omit how this evaluator is provided; that is a functional incoherence (CLI implies standalone invocation but code requires an injected callable). Also SKILL.md and code reference state/pareto.json and hidden test loading that requires a password/key for decryption, but no guidance is given for supplying or protecting that secret. The instructions also assume read/write access to local files (benchmarks.db, state/pareto.json) — expected for this skill but should be explicit.
Install Mechanism
No install spec is provided and the skill is delivered as source/CLIs and docs only. That is low-risk from an install-source perspective (no remote downloads or package installs).
Credentials
The registry metadata declares no required env vars or credentials, but the code supports loading encrypted hidden tests and derives/uses passwords (HiddenTestDataSource/FileHiddenTestDataSource, HiddenTestSuite.unlock/load requiring password). The SKILL.md does not declare how passwords/keys are provided (no primaryEnv or envVars), so secret handling is under-specified. The scripts will create and write local files (benchmarks.db, state/pareto.json, exported reports) — that is expected but you should ensure file paths are appropriate and writable. No network endpoints or external credentials are requested in the files reviewed.
Persistence & Privilege
The skill does not request always:true or other elevated platform privileges. It reads/writes local state files (SQLite DB, state/pareto.json) and exports reports; that is consistent with a benchmark store and within expected privilege for such a skill.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install benchmark-store
  3. After installation, invoke the skill by name or use /benchmark-store
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.1.1
v2.0: 9-dim evaluation (coverage/completeness split), 4-role weighted scoring with category modifiers (tool/knowledge/orchestration/rule), per-dim Pareto tolerances (security 2%, efficiency 10%, others 5%), LLM-as-Judge default, enriched SKILL.md docs, 11 test fixes, GEPA references removed.
v1.1.0
v1.1.0: Fix 4 critical pipeline bugs (Ralph Wiggum/Autoloop/Evaluator verdict), scoring overhaul (base 4->2, LLM weight 50%, semantic relevance), generator LLM-first, learner/gate/executor fixes
v1.0.0
Initial release of benchmark-store skill. - Provides tools for initializing, querying, and comparing benchmark data. - Offers Pareto front regression checks (flags if any skill dimension regresses >5%). - Exposes quality grading standards: POWERFUL, SOLID, GENERIC, WEAK. - Delivers artifacts for benchmarking: SQLite DB, comparison deltas, and regression JSON. - Clearly separates benchmarking from scoring and automated improvement workflows.
Metadata
Slug benchmark-store
Version 1.1.1
License MIT-0
All-time Installs 1
Active Installs 1
Total Versions 3
Frequently Asked Questions

What is Benchmark Store?

当需要初始化基准数据库、对比 skill 评分与历史基线、查看 Pareto front 是否有维度回退、或查阅质量分级标准时使用。不用于给候选打分(用 improvement-discriminator)或自动改进(用 improvement-learner)。 It is an AI Agent Skill for Claude Code / OpenClaw, with 117 downloads so far.

How do I install Benchmark Store?

Run "/install benchmark-store" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Benchmark Store free?

Yes, Benchmark Store is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Benchmark Store support?

Benchmark Store is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Benchmark Store?

It is built and maintained by _silhouette (@lanyasheng); the current version is v1.1.1.

💬 Comments