← 返回 Skills 市场
lanyasheng

Improvement Evaluator

作者 _silhouette · GitHub ↗ · v1.1.1 · MIT-0
cross-platform ⚠ suspicious
102
总下载
0
收藏
1
当前安装
2
版本数
在 OpenClaw 中安装
/install improvement-evaluator
功能描述
当需要验证 Skill 改进是否真正提升了 AI 执行效果时使用。通过预定义任务集(YAML)运行 AI 任务,判定 pass/fail,输出 execution_pass_rate。不用于文档结构评分(用 improvement-learner)或候选打分(用 improvement-discriminator)。
使用说明 (SKILL.md)

Improvement Evaluator

Measures whether a Skill actually makes AI perform better on real tasks, not just whether the SKILL.md document looks well-structured.

Why Execution Testing Matters

Structural scoring (word count, section presence, formatting) correlates poorly with actual AI task performance. Internal benchmarks showed R²=0.00 between document-structure scores and execution pass rates across 40+ skill evaluations. A perfectly formatted SKILL.md can still produce failing task outputs if the instructions mislead the model or omit critical constraints.

Tradeoff: Execution testing is slower and more expensive than structural checks because it invokes the AI model once per task. A 7-task suite at pass@1 costs roughly 7 API calls per candidate plus 7 for the baseline. This is acceptable because structural scoring alone gives no signal about whether the skill actually works. To offset cost, the evaluator caches baseline results for 7 days and supports --pass-k 1 (single attempt) as the default to keep runs lean.

When to Use

  • Verify that a SKILL.md change improves AI task execution, not just document structure
  • Run a task suite against a candidate SKILL.md and compare pass rate with baseline
  • Get execution_pass_rate as a concrete quality metric for gating decisions
  • Validate that a newly written task suite produces a sane baseline (>20% pass rate)
  • Compare two versions of a skill on the same task suite to detect regressions
  • Feed execution deltas into the improvement-gate for accept/reject decisions
  • Debug low scores by inspecting per-task pass/fail details in the output artifact
  • Run standalone evaluations during skill development without a full pipeline

When NOT to Use

  • Checking SKILL.md structure quality only (use improvement-learner instead)
  • Scoring candidates with semantic rubrics before execution (use improvement-discriminator)
  • Running the full generate-score-evaluate-execute-gate pipeline (use improvement-orchestrator)
  • Measuring document formatting, section counts, or word-level metrics

Task Suite Format

A task suite is a YAML file that defines what tasks to run and how to judge them. Each suite targets a specific skill and contains 5-10 tasks covering the skill's core behaviors. The schema is versioned at "1.0".

# task_suite.yaml -- minimal complete example
skill_id: "target-skill-name"
version: "1.0"
tasks:
  - id: "task-keyword-check"
    description: "Verify output mentions required concepts"
    prompt: "Given these scores {accuracy: 0.9}, what quality tier?"
    judge:
      type: "contains"
      expected: ["POWERFUL"]
    timeout_seconds: 30

  - id: "task-semantic-quality"
    description: "Rubric-scored analysis quality"
    prompt: "Accuracy dropped 0.9 to 0.8 but coverage rose. Accept?"
    judge:
      type: "llm-rubric"
      rubric: "Must mention trade-off analysis and give a recommendation"
      pass_threshold: 0.7
    timeout_seconds: 120

Validation rules enforced at load time:

  • skill_id must be non-empty.
  • version must equal "1.0".
  • Every task needs a unique id, a non-empty prompt, and a judge block.
  • Judge type must be one of contains, pytest, or llm-rubric.
  • For contains: expected must be a non-empty list of strings.
  • For pytest: test_file must start with fixtures/ (path-traversal guard).
  • For llm-rubric: rubric must be non-empty.

See references/task-format.md and references/writing-tasks-guide.md for detailed patterns and anti-patterns.

Judge Types

The evaluator supports three judge types. Choose based on determinism needs and output complexity.

Judge Mechanism Best For
ContainsJudge Checks all expected keywords appear (case-insensitive) Deterministic presence checks, format validation
PytestJudge Runs pytest on AI output via AI_OUTPUT_FILE env var Structured output, JSON schema validation
LLMRubricJudge LLM scores output against a rubric (0.0-1.0) Semantic quality, open-ended evaluation

Because deterministic judges (Contains, Pytest) are fast and free while LLM judges cost an API call per evaluation, prefer deterministic judges when the pass condition can be expressed as keyword presence or structured format. Reserve LLMRubricJudge for tasks where semantic quality matters and no deterministic proxy exists.

Judge configuration examples:

# ContainsJudge -- all keywords must appear (case-insensitive)
judge:
  type: "contains"
  expected: ["validation", "sanitiz", "error handling"]

# PytestJudge -- test file receives AI output path via AI_OUTPUT_FILE
judge:
  type: "pytest"
  test_file: "fixtures/test_output_format.py"

# LLMRubricJudge -- score 0.0-1.0, pass if >= threshold
judge:
  type: "llm-rubric"
  rubric: |
    Score 0.0-1.0:
    - 0.8+: Correct analysis with actionable recommendation
    - 0.5-0.8: Partial analysis, missing specifics
    - \x3C0.5: Generic or incorrect
  pass_threshold: 0.7

LLMRubricJudge supports --mock mode for local testing without API calls. In mock mode the judge returns a fixed passing score so you can verify the pipeline wiring without incurring cost.

\x3Cexample> Evaluate a candidate skill in pipeline mode: $ python3 scripts/evaluate.py
--input ranking.json
--candidate-id c1
--task-suite tasks.yaml
--state-root /tmp/eval-state → {"execution_pass_rate": 0.80, "baseline_pass_rate": 0.70, "delta": 0.10, "verdict": "pass"} \x3C/example>

\x3Canti-example> Running the evaluator without a task suite file: → Preflight fails with "Task suite not found" -- the evaluator requires a valid task_suite.yaml.

Running with a broken task suite (baseline pass rate \x3C 20%): → Aborts with verdict="error" and reason="baseline pass rate X \x3C 0.2". Fix the suite first. \x3C/anti-example>

CLI Reference

Two operating modes: pipeline mode (with ranking artifact from discriminator) and standalone mode (direct evaluation during development).

# Pipeline mode -- requires ranking artifact from discriminator stage
python3 scripts/evaluate.py \
  --input ranking-artifact.json \
  --candidate-id cand-01-docs \
  --task-suite task_suites/target-skill/task_suite.yaml \
  --state-root /tmp/eval-state \
  --pass-k 1 \
  --baseline-cache-dir /tmp/baseline-cache \
  --eval-threshold 6.0 \
  --output /tmp/eval-result.json

# Standalone mode -- evaluate a skill directly without pipeline artifacts
python3 scripts/evaluate.py \
  --standalone \
  --task-suite task_suites/deslop/task_suite.yaml \
  --skill-path ./skills/deslop \
  --state-root /tmp/eval-state \
  --mock
Flag Required Default Purpose
--input pipeline -- Path to ranking artifact JSON from discriminator
--candidate-id pipeline -- ID of candidate to evaluate
--standalone standalone false Run without ranking artifact
--task-suite always -- Path to task suite YAML
--state-root always -- Directory for evaluation state and output
--skill-path standalone -- Path to SKILL.md or skill directory
--pass-k no 1 Attempts per task (passes if any attempt succeeds)
--baseline-cache-dir no none Cache baseline results (7-day TTL)
--eval-threshold no 6.0 Minimum discriminator score to proceed
--mock no false Use mock execution, no claude CLI needed
--output no auto Override output path (default: \x3Cstate-root>/evaluations/\x3Crun-id>.json)

Output Artifacts

The evaluator writes a JSON artifact to \x3Cstate-root>/evaluations/\x3Crun-id>.json (or the path specified by --output). Downstream consumers are the improvement-gate and improvement-orchestrator.

Field Type Description
execution_pass_rate float Candidate pass rate (0.0-1.0)
baseline_pass_rate float Original SKILL.md pass rate (0.0-1.0)
delta float candidate - baseline; non-negative means improvement
verdict string pass, fail, skipped, or error
candidate_results array Per-task breakdown with task_id, passed, score, duration_ms
baseline_results array Same structure for baseline run
truth_anchor string Absolute path to this artifact for audit trail

Verdict logic: pass when delta >= 0 (candidate is at least as good as baseline). skipped when candidate discriminator score is below --eval-threshold. error when baseline pass rate \x3C 20% (broken task suite).

Related Skills

  • improvement-discriminator -- Runs semantic scoring before this stage. Produces the ranking artifact that this evaluator consumes. Use discriminator when you need LLM panel review scores, not execution-based pass rates.
  • improvement-gate -- Consumes this evaluator's output artifact. Applies a 6-layer mechanical gate (Schema, Compile, Lint, Regression, Review, HumanReview) to decide whether to accept or reject the change.
  • improvement-orchestrator -- Coordinates the full pipeline: generate, discriminate, evaluate, execute, gate. Use orchestrator when you want the end-to-end flow rather than running individual stages.
  • improvement-learner -- Structural quality scoring (6-dimension). Use learner when you only care about document quality metrics, not execution effectiveness.
安全使用建议
This skill appears to implement the described evaluator, but it has a few things to check before use: - Dependency: The code expects the 'claude' CLI for real LLM evaluations (scripts call 'claude -p'). The registry metadata did not declare this required binary. If you don't have 'claude' installed or don't want to use it, run with --mock to avoid external CLI calls. - Review packaged tests/fixtures: The PytestJudge runs pytest on test files under the skill's tests/fixtures directory. Those tests run as subprocesses with your agent's environment (os.environ) available to them. Inspect any fixture test code before running evaluations to make sure it doesn't read or exfiltrate environment variables, read local files you care about, or perform unexpected network calls. - Secrets exposure: Because child processes inherit environment variables, any secrets present in your agent/process environment could be visible to tests. Consider running evaluations in an isolated environment (no sensitive env vars), or use --mock mode when developing or when you cannot guarantee test file safety. - If you intend to run this in production, ask the skill author to (a) declare 'claude' as a required binary in metadata, and (b) minimize env propagation to subprocesses or explicitly document which env vars are required. Prefer running with a dedicated service account and a sanitized environment. Overall: the package is functional and coherent with its purpose, but the missing declared dependency and the potential for test code to access inherited environment variables are real operational/security concerns. Use caution, inspect fixtures, and prefer --mock for exploratory runs.
功能分析
Type: OpenClaw Skill Name: improvement-evaluator Version: 1.1.1 The bundle implements an evaluation framework for measuring the effectiveness of AI skill improvements. It uses subprocess execution to run 'pytest' for structured output validation and the 'claude' CLI for rubric-based scoring; while these are high-privilege operations, they are functionally necessary for the stated purpose. The implementation in 'interfaces/judges.py' includes explicit security mitigations, such as path traversal guards and symlink resolution, to ensure that 'pytest' only executes code within the designated 'fixtures/' directory, demonstrating a defensive posture rather than malicious intent.
能力评估
Purpose & Capability
The skill's name and description align with the included code: it runs task suites, invokes an LLM client, and applies multiple judge types. However the package fails to declare a key runtime dependency: the code requires the 'claude' CLI (used for LLM-driven evaluation) but the registry metadata lists no required binaries or environment. That mismatch (missing declared dependency) is a coherence issue.
Instruction Scope
Runtime instructions and code prepend SKILL.md text to prompts and invoke the external 'claude' CLI, and the PytestJudge runs pytest on packaged test files. Pytest is executed with the process environment inherited (os.environ) and AI output passed via AI_OUTPUT_FILE. That means test code executed by the evaluator runs as a subprocess with access to the agent's environment and file system; malicious or poorly written tests could read environment secrets or perform I/O. The skill does include path-traversal guards (test_file must start with 'fixtures/' and resolution is constrained to tests/fixtures), but it does not limit what a fixture test can do once executed.
Install Mechanism
There is no install spec (instruction-only in the registry), so nothing is automatically downloaded or executed during install. The package includes Python scripts and tests that will run at runtime; that is fine but means runtime behavior depends on local environment (presence of 'claude' CLI).
Credentials
The skill declares no required env vars or credentials, yet at runtime it inherits and forwards the full process environment to subprocesses (pytest invocation merges os.environ into the child env). Because tests and judges run as subprocesses, any environment variables (including secrets present in the agent environment) will be visible to them. In addition, the code calls external 'claude' CLI (unless run with --mock), which may itself use credentials/config from the environment or local config files. The skill does not request or document these needs in metadata.
Persistence & Privilege
The skill does not request permanent/always-on inclusion, does not modify other skills' configs, and has no install-time hooks. always:false and normal autonomous invocation are used. No elevated persistence privileges are requested.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install improvement-evaluator
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /improvement-evaluator 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.1.1
v2.1: 13/13 POWERFUL at 92.2%, enriched SKILL.md docs, README added, example tag fix
v1.1.0
v1.1.0: Fix 4 critical pipeline bugs (Ralph Wiggum/Autoloop/Evaluator verdict), scoring overhaul (base 4->2, LLM weight 50%, semantic relevance), generator LLM-first, learner/gate/executor fixes
元数据
Slug improvement-evaluator
版本 1.1.1
许可证 MIT-0
累计安装 1
当前安装数 1
历史版本数 2
常见问题

Improvement Evaluator 是什么?

当需要验证 Skill 改进是否真正提升了 AI 执行效果时使用。通过预定义任务集(YAML)运行 AI 任务,判定 pass/fail,输出 execution_pass_rate。不用于文档结构评分(用 improvement-learner)或候选打分(用 improvement-discriminator)。 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 102 次。

如何安装 Improvement Evaluator?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install improvement-evaluator」即可一键安装,无需额外配置。

Improvement Evaluator 是免费的吗?

是的,Improvement Evaluator 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Improvement Evaluator 支持哪些平台?

Improvement Evaluator 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Improvement Evaluator?

由 _silhouette(@lanyasheng)开发并维护,当前版本 v1.1.1。

💬 留言讨论