← 返回 Skills 市场
lanyasheng

Improvement Evaluator

作者 _silhouette · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
82
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install auto-improvement-evaluator
功能描述
当需要验证 Skill 改进是否真正提升了 AI 执行效果时使用。通过预定义任务集(YAML)运行 AI 任务,判定 pass/fail,输出 execution_pass_rate。不用于文档结构评分(用 improvement-learner)或候选打分(用 improvement-discriminator)。
使用说明 (SKILL.md)

Improvement Evaluator

Measures whether a Skill actually makes AI perform better on real tasks.

When to Use

  • Verify that a SKILL.md change improves AI task execution (not just document structure)
  • Run a task suite against a candidate SKILL.md and compare with baseline
  • Get execution_pass_rate as a concrete quality metric
  • Run standalone evaluation on current SKILL.md to discover baseline failures

When NOT to Use

  • 只想检查 SKILL.md 结构质量 → use improvement-learner
  • 只想给候选打分 → use improvement-discriminator
  • 跑全流程 → use improvement-orchestrator

2 Modes

Mode When Required Params
Pipeline Called by orchestrator after discriminator --input, --candidate-id, --task-suite, --state-root
Standalone Direct evaluation of current SKILL.md --standalone, --task-suite, --state-root, --skill-path

CLI

# Pipeline mode: evaluate candidate vs baseline
python3 scripts/evaluate.py --input ranking.json --candidate-id cand-01-docs \
  --task-suite tasks.yaml --state-root ./state \
  [--pass-k 1] [--eval-threshold 6.0] [--baseline-cache-dir /cache] [--mock] [--output eval.json]

# Standalone mode: evaluate current SKILL.md directly
python3 scripts/evaluate.py --standalone --task-suite tasks.yaml \
  --state-root ./state --skill-path /path/to/skill [--mock]
Param Default When to change
--eval-threshold 6.0 Orchestrator sets per-category thresholds (e.g., docs=5.0, prompt=7.0)
--pass-k 1 Raise to 3 for flaky tasks
--mock false Use in CI or when claude CLI is not installed
--baseline-cache-dir None Set to avoid re-running baseline on unchanged SKILL.md

3 Judge Types

Judge type in YAML Mechanism Use When
ContainsJudge contains Check output contains all strings in expected list Deterministic keyword/format checks
PytestJudge pytest Run pytest on fixtures/{test_file} against AI output Structured output validation (JSON, code)
LLMRubricJudge llm-rubric LLM scores output against rubric text (mock mode: random pass) Semantic quality evaluation

Task Suite YAML Format

skill_id: my-skill
version: "1.0"
tasks:
  - id: task-001
    prompt: "Given X, produce Y"
    judge: {type: contains, expected: ["keyword1", "keyword2"]}
  - id: task-002
    prompt: "Generate a config file"
    judge: {type: pytest, test_file: fixtures/test_config.py}
  - id: task-003
    prompt: "Explain concept Z"
    judge: {type: llm-rubric, rubric: "Must cover A, B, C with examples"}

Conditional Evaluation

  • Score threshold: candidates with discriminator score \x3C --eval-threshold are skipped (verdict=skipped)
  • Baseline abort: if baseline pass_rate \x3C 0.2 (20%), evaluation aborts with verdict=error — indicates broken task suite
  • Baseline caching: SHA256(skill_content + suite_path) → 7-day TTL cache to avoid re-running unchanged baselines

\x3Cexample> Pipeline mode: candidate vs baseline comparison $ python3 scripts/evaluate.py --input ranking.json --candidate-id c1 --task-suite tasks.yaml --state-root ./state → Candidate pass rate: 0.80 (4/5 tasks passed) → Baseline pass rate: 0.60 (3/5 tasks passed) → {"execution_pass_rate": 0.80, "baseline_pass_rate": 0.60, "delta": 0.20, "verdict": "pass"} \x3C/example>

\x3Canti-example> Using evaluator without task suite: → Evaluator requires --task-suite. Without it, orchestrator skips evaluator entirely. → No --standalone without --task-suite either — both modes require it. \x3C/anti-example>

Output Artifact

{"stage": "evaluated", "verdict": "pass",
 "evaluation": {"execution_pass_rate": 0.80, "baseline_pass_rate": 0.60, "delta": 0.20},
 "candidate_results": [{"task_id": "t1", "passed": true, "score": 1.0}],
 "next_step": "gate_decision", "next_owner": "gate"}

Related Skills

  • improvement-discriminator: Provides scores; evaluator checks score >= eval_threshold
  • improvement-gate: RegressionGate checks evaluator verdict via --evaluation artifact
  • improvement-orchestrator: Calls evaluator as stage 3; runs standalone baseline, injects failures to generator
  • improvement-generator: Consumes baseline-failures.json for targeted SKILL.md fixes
安全使用建议
This evaluator appears to implement the stated function, but take these precautions before installing or running it: - Expect the evaluator to call the external claude CLI by default (unless you pass --mock). The registry metadata did not declare this binary dependency — ensure you only run it where 'claude' is intentionally available. - The runner reads SKILL.md (candidate or baseline) from disk and prepends it to prompts sent to the LLM. Do NOT run this on machines containing secrets or private files you wouldn't want to send to an external LLM; validate the 'target' paths the orchestrator will provide. - Pytest-based judges execute test code (via pytest) against AI outputs. Review any fixture/test files included in task suites to ensure they don't run malicious code or perform unwanted side effects. The PytestJudge includes checks that test_file must start with 'fixtures/' and resolves under the skill's tests/fixtures directory, but your task suites' fixture layout must match the implementation; otherwise tests may fail or be ignored. - If you need to run evaluations in CI or on sensitive data, use the --mock flag to avoid calling the claude CLI and to get deterministic, safe behavior. - Consider asking the skill publisher to: (1) declare the 'claude' dependency in metadata; (2) document explicitly that SKILL.md and candidate skill contents will be sent to an external LLM; (3) clarify where fixtures should live so pytest judges operate on the intended test files. If you can, audit the task suites you intend to run and run the evaluator in a restricted environment (no sensitive files, network-restricted or using --mock) until you are comfortable with its behavior.
功能分析
Type: OpenClaw Skill Name: auto-improvement-evaluator Version: 1.0.0 The bundle provides a framework for evaluating the effectiveness of AI skills by running task suites and comparing pass rates against a baseline. The core logic in `scripts/evaluate.py` and `scripts/task_runner.py` uses the `claude` CLI to execute tasks and various 'judges' in `interfaces/judges.py` to score results. Notably, the `PytestJudge` includes explicit security checks to prevent path traversal by validating that test files remain within the `fixtures/` directory. The code is well-structured, includes comprehensive unit tests, and its behavior aligns strictly with the stated purpose of skill improvement evaluation.
能力评估
Purpose & Capability
The skill's code and SKILL.md align with the stated purpose (run task suites and compute pass rates). However the registry metadata claims no required binaries while the runtime requires the external 'claude' CLI unless --mock is used. Also the package contains multiple executable scripts and test harnesses (not just an instruction-only skill), which is consistent with its function but is more powerful than the 'no binaries / no env' metadata implies.
Instruction Scope
Runtime behavior prepends SKILL.md (or candidate content) into prompts and sends them to an external LLM via the claude CLI. The runner reads SKILL.md from an arbitrary 'target' path if provided, writes AI outputs to temp files, and can run pytest tests against those outputs. That means arbitrary file contents can be read and included in prompts and arbitrary test code (fixtures) can be executed. The PytestJudge implements path checks to restrict test_file to tests/fixtures in the skill, but task suite semantics in the docs/examples suggest fixtures may live in suite folders — there is a mismatch that could cause unexpected behaviour or missing fixtures.
Install Mechanism
No install spec is provided (instruction-only style with included scripts). This is low-risk from arbitrary downloads or install-time code execution. The code will run subprocesses at runtime (claude, pytest) but nothing in the package downloads or extracts external archives.
Credentials
The skill declares no required credentials or env vars, yet it invokes an external LLM (claude CLI) and forwards the process environment into pytest runs. It will read SKILL.md or candidate content from disk and send it to the LLM — this can leak secrets or sensitive files if a 'target' path points to sensitive locations. There is also no explicit declaration that network access (via claude) will be used.
Persistence & Privilege
The skill does not request 'always: true' and does not modify other skills or global agent config. It runs as an on-demand evaluator; autonomous invocation is allowed by default but that is platform normal and not by itself flagged here.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install auto-improvement-evaluator
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /auto-improvement-evaluator 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release: closed-loop skill improvement pipeline
元数据
Slug auto-improvement-evaluator
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Improvement Evaluator 是什么?

当需要验证 Skill 改进是否真正提升了 AI 执行效果时使用。通过预定义任务集(YAML)运行 AI 任务,判定 pass/fail,输出 execution_pass_rate。不用于文档结构评分(用 improvement-learner)或候选打分(用 improvement-discriminator)。 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 82 次。

如何安装 Improvement Evaluator?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install auto-improvement-evaluator」即可一键安装,无需额外配置。

Improvement Evaluator 是免费的吗?

是的,Improvement Evaluator 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Improvement Evaluator 支持哪些平台?

Improvement Evaluator 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Improvement Evaluator?

由 _silhouette(@lanyasheng)开发并维护,当前版本 v1.0.0。

💬 留言讨论