← 返回 Skills 市场
lanyasheng

Improvement Discriminator

作者 _silhouette · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
79
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install auto-improvement-discriminator
功能描述
当需要对改进候选多人盲审打分、用 LLM 做语义评估、判断候选是否应被接受、或打分结果全是 hold 想知道为什么时使用。支持 --panel 多审阅者盲审和 --llm-judge 语义评估。不用于结构评估(用 improvement-learner)或门禁决策(用 improvement-gate)。
使用说明 (SKILL.md)

Improvement Discriminator

Multi-signal scoring engine. Blends heuristic rules, evaluator rubrics, LLM-as-Judge, and multi-reviewer blind panel to score, rank, and recommend actions on improvement candidates.

When to Use / NOT to Use

  • Score and rank candidates, run panel blind review, run LLM-as-Judge semantic evaluation, diagnose hold results
  • NOT for structural evaluation (improvement-learner), gate decisions (improvement-gate), or file changes (improvement-executor)

CLI

python3 scripts/score.py --input CANDS.json [--output SCORED.json] [--state-root DIR]
  [--panel] [--llm-judge {claude,openai,mock}] [--use-evaluator-evidence]
Param Description
--input Required. Candidate artifact JSON from generator
--output Output path. Default: {state-root}/rankings/{run_id}.json
--state-root State directory. Default: state/
--panel Enable 4-reviewer blind panel (structural, conservative, user_advocate, security_auditor)
--llm-judge Enable LLM-as-Judge. Backends: claude (Anthropic API), openai, mock (deterministic, no key)
--use-evaluator-evidence Blend skill-evaluator rubric/category/boundary evidence

Scoring Modes and Blending Weights

Mode Blending
Heuristic only (default) 100% heuristic (base 4.0 + category bonus + source refs - risk penalty)
--use-evaluator-evidence 70% heuristic + 30% evaluator
--llm-judge 60% heuristic + 40% LLM
Both flags 50% heuristic + 30% LLM + 20% evaluator
--panel 4 reviewers score independently; cognitive label decides final recommendation

Category bonuses: docs=4.0, reference=3.5, guardrail=3.5, workflow=1.5, tests=1.5, prompt=1.0. Risk penalties: low=0.0, medium=2.0, high=4.5. Protected path adds +2.5.

Multi-Reviewer Panel

Reviewer Focus Risk Sensitivity
structural docs (5.0), reference (4.0) 1.0x
conservative guardrail (5.0), penalizes prompt (0.5) 1.5x
user_advocate workflow (4.0), prompt (3.0) 0.8x
security_auditor guardrail (5.0), tests (3.0) 2.0x

Cognitive labels: CONSENSUS (all agree) -> shared recommendation. VERIFIED (2+ agree) -> majority. DISPUTED (no majority) -> forced hold.

LLM Judge

Evaluates 4 dimensions (0.0-1.0): clarity, specificity, consistency, safety. Thresholds: approve >= 0.75, reject \x3C 0.40, else conditional.

Backend Model Key Fallback
claude claude-sonnet-4-20250514 ANTHROPIC_API_KEY (supports ANTHROPIC_BASE_URL) mock
openai gpt-4o-mini OPENAI_API_KEY mock
mock none none deterministic, confidence=0.5

Blockers

protected_target, executor_not_supported, not_auto_keep_category, risk_medium/risk_high, skill_level_insufficient_for_structural_change, evaluator_reject, llm_judge_reject

Output JSON Example

{
  "run_id": "abc-123", "stage": "ranked", "critic_mode": "multi-reviewer-panel",
  "scored_candidates": [{
    "id": "cand-001", "score": 7.25, "recommendation": "accept_for_execution",
    "blockers": [], "judge_notes": ["低风险候选,可交给 executor。"],
    "panel": {
      "panel_reviews": [{"reviewer": "structural", "score": 8.5}, {"reviewer": "conservative", "score": 6.0}],
      "cognitive_label": "CONSENSUS", "aggregated_score": 7.25
    },
    "llm_verdict": {"score": 0.82, "decision": "approve",
      "dimensions": {"clarity": 0.85, "specificity": 0.80, "consistency": 0.80, "safety": 0.90}}
  }],
  "summary": {"accept_for_execution": 1, "hold": 0, "reject": 0}
}

\x3Cexample> Panel + LLM judge: $ python3 scripts/score.py --input candidates.json --panel --llm-judge mock --output scored.json \x3C/example>

\x3Canti-example> --panel and --llm-judge are NOT mutually exclusive. Each reviewer independently calls the LLM judge. \x3C/anti-example>

Related Skills

  • improvement-generator -- produces candidates | improvement-gate -- keep/revert/reject
  • improvement-learner -- structural 6-dim eval | benchmark-store -- frozen baselines
安全使用建议
Key points before installing/using: - The SKILL.md mentions LLM backends that require ANTHROPIC_API_KEY or OPENAI_API_KEY but the registry metadata lists no required env vars — treat API keys as optional only if you use the mock backend. If you enable --llm-judge with a real backend, provide credentials only if you trust the skill. - The evaluator can load and call arbitrary Python modules (RealSkillEvaluator / importlib). Do not point it at untrusted directories or candidate artifacts that contain executable code unless you sandbox the run (e.g., isolated container, restricted runtime). - The skill reads and writes local state (default state/). Review scripts/score.py and the RealSkillEvaluator implementation to confirm exactly what files are read/written and whether any network calls beyond LLM providers occur (some files were truncated in the listing). - If you want to avoid external API calls, run with --llm-judge mock and/or inspect/disable the LLMJudge backend code. - Recommended actions: inspect scripts/score.py and the RealSkillEvaluator code paths in full, run the skill in a sandboxed environment first, and only supply API keys if needed and you understand what vectors (file loading, outbound network) will be used.
功能分析
Type: OpenClaw Skill Name: auto-improvement-discriminator Version: 1.0.0 The skill bundle provides a framework for evaluating agent skill improvements using heuristics, LLM-based judging, and automated testing. It contains high-risk capabilities, most notably the RealSkillEvaluator in interfaces/critic_engine.py, which dynamically loads and executes arbitrary Python code using importlib.util.exec_module. While intended for evaluating skill logic, this facilitates Remote Code Execution (RCE) if the input paths are untrusted. Furthermore, the LLMJudge in interfaces/llm_judge.py is susceptible to indirect prompt injection from candidate content, and the JUnitXMLRegressionAdapter in interfaces/external_regression.py uses the insecure xml.etree.ElementTree parser, posing an XML External Entity (XXE) risk. These components represent significant vulnerabilities rather than intentional malware.
能力评估
Purpose & Capability
The skill intends to score and rank improvement candidates and to optionally use an LLM judge — that aligns with the included code (critic engine, human review, llm_judge). However the registry metadata lists no required environment variables while SKILL.md and llm_judge.py explicitly document/use ANTHROPIC_API_KEY / OPENAI_API_KEY (and optional base URLs). This mismatch is noteworthy: the skill can operate in mock mode without keys, but using real LLM backends requires credentials that were not declared in the registry metadata.
Instruction Scope
SKILL.md instructs running scripts/score.py with --input and optional flags (panel, --llm-judge, --use-evaluator-evidence). That matches the repo's scripts. The instructions do not overtly request arbitrary system secrets, but the implementation will read/write state (default state/ path) and can save human review receipts to disk. More importantly, the Critic/RealSkillEvaluator supports loading Python skill modules from file paths and invoking evaluate()/execute() functions — this can execute arbitrary code supplied as a candidate or present on disk, which expands runtime scope beyond simple scoring.
Install Mechanism
No install spec is provided (instruction-only install), and no external downloads are performed by the package itself. The code may import third-party SDKs (anthropic, openai) at runtime if the user enables those backends, but there is no automatic installer or external URL fetch in the provided metadata.
Credentials
Registry metadata declares no required env vars, but SKILL.md and interfaces/llm_judge.py document/use ANTHROPIC_API_KEY and OPENAI_API_KEY (and support ANTHROPIC_BASE_URL). The skill will attempt to call networked LLM backends when --llm-judge is used, which requires those keys. This is a mismatch between declared requirements and actual code. Additionally, the code inserts a sibling 'benchmark-store' path and attempts to import from it — that implies access to other local skill code/config, which increases the scope of data accessible at runtime.
Persistence & Privilege
always:false and autonomous invocation are default/normal. The skill writes state and review receipts (HumanReviewReceipt.save writes files under state paths) and the critic engine may load external benchmark or skill modules from disk. It does not explicitly modify other skills' configurations, but loading/executing other Python modules grants it runtime privilege equivalent to executing arbitrary code if untrusted modules or paths are supplied.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install auto-improvement-discriminator
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /auto-improvement-discriminator 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release: closed-loop skill improvement pipeline
元数据
Slug auto-improvement-discriminator
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Improvement Discriminator 是什么?

当需要对改进候选多人盲审打分、用 LLM 做语义评估、判断候选是否应被接受、或打分结果全是 hold 想知道为什么时使用。支持 --panel 多审阅者盲审和 --llm-judge 语义评估。不用于结构评估(用 improvement-learner)或门禁决策(用 improvement-gate)。 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 79 次。

如何安装 Improvement Discriminator?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install auto-improvement-discriminator」即可一键安装,无需额外配置。

Improvement Discriminator 是免费的吗?

是的,Improvement Discriminator 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Improvement Discriminator 支持哪些平台?

Improvement Discriminator 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Improvement Discriminator?

由 _silhouette(@lanyasheng)开发并维护,当前版本 v1.0.0。

💬 留言讨论