Description

当需要初始化基准数据库、对比 skill 评分与历史基线、查看 Pareto front 是否有维度回退、或查阅质量分级标准时使用。不用于给候选打分（用 improvement-discriminator）或自动改进（用 improvement-learner）。

README (SKILL.md)

Benchmark Store

Name: Benchmark Store
Author: lanyasheng

Frozen benchmarks, hidden tests, Pareto front, and evaluation standards.

When to Use

初始化或查询基准数据库
对比 skill 评分与冻结基线
检查 Pareto front（任何维度回退 >5% 即拒绝）
查阅质量分级标准（POWERFUL/SOLID/GENERIC/WEAK）
添加新的冻结测试用例到基准库
查看某个 skill 在所有维度上的历史最优分数
为 improvement-gate 的 RegressionGate 提供 Pareto 基线数据
在批量评估场景下列出所有已注册的 benchmark 条目

When NOT to Use

给候选打分 → use improvement-discriminator
自动改进 → use improvement-learner
全流程 → use improvement-orchestrator
执行变更 → use improvement-executor
门禁决策 → use improvement-gate（它消费 benchmark-store 的数据）

Quality Tiers

Tier	Score	Ship?
POWERFUL ⭐	≥ 85%	Marketplace ready
SOLID	70–84%	GitHub
GENERIC	55–69%	Needs iteration
WEAK	\x3C 55%	Reject or rewrite

分级基于所有维度的加权综合分。每个维度在 evaluation-standards.md 中有独立权重。accuracy 和 coverage 权重最高（各 0.2），security 权重 0.15，其余维度各 0.1-0.15。分级用于 improvement-orchestrator 决定是否继续迭代：POWERFUL 即停止，WEAK 必须重试。

Why Pareto Front Instead of Weighted Average

Tradeoff: 用加权平均分来判断回退看似简单（一个数字对比），但它会隐藏维度间的此消彼长。例如 accuracy 从 0.9 降到 0.6 而 coverage 从 0.5 升到 0.8，加权平均可能持平甚至上升，但 accuracy 的大幅回退被掩盖了。Because Pareto front 逐维度检查，任何单一维度回退超过 5% 阈值都会触发拒绝，确保改进是真正的帕累托改进（没有维度变差）而非以牺牲某个维度为代价的伪改进。

这个设计的代价是改进更难被接受（通过率约 40-50%），但避免了"越改越偏"的漂移问题。在 autoloop 场景下，Pareto 保护是防止 LLM 在多轮迭代中逐步丢失已有优势的关键机制。

Pareto Front

ParetoFront.check_regression(new_scores) → {"regressed": bool, "regressions": [...]}
# 5% tolerance — minor fluctuations allowed

Pareto 回退检查的完整流程：加载历史最优分 → 逐维度对比 → 超过 5% 阈值的维度标记为 regression → 任何一个 regression 即判定整体 regressed=True。

# Pareto regression check — full example
from lib.pareto import ParetoFront

pf = ParetoFront("state/pareto.json")
new_scores = {"accuracy": 0.82, "coverage": 0.90, "reliability": 1.0,
              "efficiency": 0.95, "security": 1.0, "trigger_quality": 0.75}
result = pf.check_regression(new_scores)
# result = {"regressed": True,
#           "regressions": [{"dim": "accuracy", "before": 0.90, "after": 0.82, "delta": -0.08}]}
# accuracy dropped 8% (> 5% threshold) → regression detected → candidate rejected

\x3Cexample> 正确: 检查 Pareto front 是否有回退 $ python3 -c "from lib.pareto import ParetoFront; pf = ParetoFront('state/pareto.json'); print(pf.check_regression({'accuracy': 0.9, 'coverage': 0.8}))" → {"regressed": false, "regressions": []} # 无回退，可以接受 \x3C/example>

\x3Canti-example> 错误: 用 benchmark-store 给候选打分 → benchmark-store 只存数据，打分用 improvement-discriminator \x3C/anti-example>

CLI

# List benchmarks
python3 scripts/benchmark_db.py --action list --db-path benchmarks.db

# Compare skill against baselines
python3 scripts/benchmark_db.py --action compare --skill-path /path/to/skill --category general --db-path benchmarks.db

# Add a benchmark
python3 scripts/benchmark_db.py --action add --category general --test-name "test1" --db-path benchmarks.db

Output Artifacts

Request	Deliverable
Init	SQLite database with schema
Compare	JSON comparison with per-dimension delta
Pareto check	JSON with regressed flag and details
List	All registered benchmarks with metadata
Add	Confirmation of new benchmark insertion

Compare 输出包含每个维度的 before/after/delta 三元组，以及整体的 tier_before/tier_after 分级变化。当从 SOLID 升级到 POWERFUL 或从 GENERIC 降级到 WEAK 时会额外标注 tier_changed: true。

Related Skills

improvement-learner: Imports ParetoFront for self-improvement loop
improvement-gate: RegressionGate uses Pareto data to block regressions
improvement-discriminator: References evaluation standards for scoring context
improvement-orchestrator: Full pipeline queries benchmark-store at gate stage
autoloop-controller: Uses historical benchmark data to detect convergence plateau

Data Files

data/evaluation-standards.md — Quality tiers, dimensions, weights (v2.0.0)
data/fixtures/ — Frozen test fixtures
state/pareto.json — Per-skill Pareto front historical best scores
benchmarks.db — SQLite database storing all benchmark entries and results

Usage Guidance

What to check before installing/using this skill: - Functionality vs CLI: The compare CLI examples assume the script will run by itself, but scripts/benchmark_db.py's compare path expects an evaluator callable (i.e., a function to run tests and return scores). Confirm how you will supply/implement that evaluator when invoking --action compare; otherwise the script deliberately refuses to return mocked scores. - Hidden tests / secrets: Hidden test loading and decryption APIs require a password/key (not declared as an env var). Decide how to store/provide any decryption password (avoid embedding in repo). Ensure you understand where any decrypted test data will be kept in plaintext and who can access it. - Filesystem writes: The skill will create or update benchmarks.db and state/pareto.json and can export reports. Run it in a sandbox or with a controlled working directory to avoid accidental writes to sensitive locations. - Code review: Because this package contains Python code (interfaces/, scripts/), review the full main/entry code paths (the truncated main() in benchmark_db.py and any functions that may import dynamic evaluators or execute code) to ensure there is no dynamic execution of untrusted code (exec/eval/os.system). The static check here did not flag obvious exec/os.system usage, but you should verify the omitted/truncated portions. - No network creds by default: The package does not request API keys or network credentials; if you later integrate external benchmark imports or remote data sources, require explicit review and avoid putting secrets into repo files. - Next steps: If you intend to run comparisons, prepare or inspect an evaluator implementation that conforms to the expected signature, confirm how encrypted hidden tests are unlocked (password handling), and run the skill in a disposable environment first. If you want, I can: (1) point out the exact lines where compare raises on missing evaluator and where HiddenTest decryption is expected, or (2) suggest a safe wrapper/driver to call the compare path with a controlled evaluator implementation.

Capability Analysis

Type: OpenClaw Skill Name: benchmark-store Version: 1.1.1 The benchmark-store skill bundle is a framework designed for the evaluation and regression testing of AI agent skills. It implements a SQLite-based benchmark database (scripts/benchmark_db.py), Pareto front tracking for multi-dimensional performance analysis (scripts/pareto.py), and mechanisms for 'hidden' tests to prevent model overfitting (interfaces/hidden_tests.py). While the bundle contains known attack payloads (e.g., SQL injection and path traversal strings in data/test-cases.yaml and data/red-team-guide.md), these are explicitly documented as test vectors for 'Red Teaming' other skills to ensure their safety. The code follows secure practices such as using parameterized SQL queries and SHA256 integrity verification, and lacks any indicators of data exfiltration or unauthorized execution.

Capability Assessment

ℹ Purpose & Capability

Name/description align with the provided files: there are scripts for a benchmarks DB, Pareto checks, evaluation standards, fixtures, and interfaces for frozen/hidden tests. However there are small inconsistencies (SKILL.md frontmatter version 0.1.0 vs registry version 1.1.1) and the runtime CLI/compare workflow described in SKILL.md does not cleanly match the implementation details in scripts/benchmark_db.py (see instruction_scope). Overall capability requested (local DB, JSON state, hidden tests) is coherent with the stated purpose.

⚠ Instruction Scope

SKILL.md instructs running scripts/benchmark_db.py --action compare as a CLI action, but the compare implementation (compare_with_benchmark) requires an evaluator callable parameter and will raise ValueError if evaluator is None. The SKILL.md examples omit how this evaluator is provided; that is a functional incoherence (CLI implies standalone invocation but code requires an injected callable). Also SKILL.md and code reference state/pareto.json and hidden test loading that requires a password/key for decryption, but no guidance is given for supplying or protecting that secret. The instructions also assume read/write access to local files (benchmarks.db, state/pareto.json) — expected for this skill but should be explicit.

✓ Install Mechanism

No install spec is provided and the skill is delivered as source/CLIs and docs only. That is low-risk from an install-source perspective (no remote downloads or package installs).

⚠ Credentials

The registry metadata declares no required env vars or credentials, but the code supports loading encrypted hidden tests and derives/uses passwords (HiddenTestDataSource/FileHiddenTestDataSource, HiddenTestSuite.unlock/load requiring password). The SKILL.md does not declare how passwords/keys are provided (no primaryEnv or envVars), so secret handling is under-specified. The scripts will create and write local files (benchmarks.db, state/pareto.json, exported reports) — that is expected but you should ensure file paths are appropriate and writable. No network endpoints or external credentials are requested in the files reviewed.

✓ Persistence & Privilege

The skill does not request always:true or other elevated platform privileges. It reads/writes local state files (SQLite DB, state/pareto.json) and exports reports; that is consistent with a benchmark store and within expected privilege for such a skill.

Version History

v1.1.1

v2.0: 9-dim evaluation (coverage/completeness split), 4-role weighted scoring with category modifiers (tool/knowledge/orchestration/rule), per-dim Pareto tolerances (security 2%, efficiency 10%, others 5%), LLM-as-Judge default, enriched SKILL.md docs, 11 test fixes, GEPA references removed.

v1.1.0

v1.1.0: Fix 4 critical pipeline bugs (Ralph Wiggum/Autoloop/Evaluator verdict), scoring overhaul (base 4->2, LLM weight 50%, semantic relevance), generator LLM-first, learner/gate/executor fixes

v1.0.0

Initial release of benchmark-store skill. - Provides tools for initializing, querying, and comparing benchmark data. - Offers Pareto front regression checks (flags if any skill dimension regresses >5%). - Exposes quality grading standards: POWERFUL, SOLID, GENERIC, WEAK. - Delivers artifacts for benchmarking: SQLite DB, comparison deltas, and regression JSON. - Clearly separates benchmarking from scoring and automated improvement workflows.

Metadata

Slug benchmark-store

Version 1.1.1

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 3

Frequently Asked Questions

What is Benchmark Store?

当需要初始化基准数据库、对比 skill 评分与历史基线、查看 Pareto front 是否有维度回退、或查阅质量分级标准时使用。不用于给候选打分（用 improvement-discriminator）或自动改进（用 improvement-learner）。 It is an AI Agent Skill for Claude Code / OpenClaw, with 117 downloads so far.

How do I install Benchmark Store?

Run "/install benchmark-store" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Benchmark Store free?

Yes, Benchmark Store is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Benchmark Store support?

Benchmark Store is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Benchmark Store?

It is built and maintained by _silhouette (@lanyasheng); the current version is v1.1.1.

More Skills

Benchmark Store

Benchmark Store

When to Use

When NOT to Use

Quality Tiers

Why Pareto Front Instead of Weighted Average

Pareto Front

CLI

Output Artifacts

Related Skills

Data Files

What is Benchmark Store?

How do I install Benchmark Store?

Is Benchmark Store free?

Which platforms does Benchmark Store support?

Who created Benchmark Store?

💬 Comments