← 返回 Skills 市场

Improvement Discriminator

Name: Improvement Discriminator
Author: lanyasheng

作者 _silhouette · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ⚠ suspicious

总下载

当前安装

版本数

在 OpenClaw 中安装

/install auto-improvement-discriminator

功能描述

当需要对改进候选多人盲审打分、用 LLM 做语义评估、判断候选是否应被接受、或打分结果全是 hold 想知道为什么时使用。支持 --panel 多审阅者盲审和 --llm-judge 语义评估。不用于结构评估（用 improvement-learner）或门禁决策（用 improvement-gate）。

使用说明 (SKILL.md)

Improvement Discriminator

Multi-signal scoring engine. Blends heuristic rules, evaluator rubrics, LLM-as-Judge, and multi-reviewer blind panel to score, rank, and recommend actions on improvement candidates.

When to Use / NOT to Use

Score and rank candidates, run panel blind review, run LLM-as-Judge semantic evaluation, diagnose hold results
NOT for structural evaluation (improvement-learner), gate decisions (improvement-gate), or file changes (improvement-executor)

CLI

python3 scripts/score.py --input CANDS.json [--output SCORED.json] [--state-root DIR]
  [--panel] [--llm-judge {claude,openai,mock}] [--use-evaluator-evidence]

Param	Description
`--input`	Required. Candidate artifact JSON from generator
`--output`	Output path. Default: `{state-root}/rankings/{run_id}.json`
`--state-root`	State directory. Default: `state/`
`--panel`	Enable 4-reviewer blind panel (structural, conservative, user_advocate, security_auditor)
`--llm-judge`	Enable LLM-as-Judge. Backends: `claude` (Anthropic API), `openai`, `mock` (deterministic, no key)
`--use-evaluator-evidence`	Blend skill-evaluator rubric/category/boundary evidence

Scoring Modes and Blending Weights

Mode	Blending
Heuristic only (default)	100% heuristic (base 4.0 + category bonus + source refs - risk penalty)
`--use-evaluator-evidence`	70% heuristic + 30% evaluator
`--llm-judge`	60% heuristic + 40% LLM
Both flags	50% heuristic + 30% LLM + 20% evaluator
`--panel`	4 reviewers score independently; cognitive label decides final recommendation

Category bonuses: docs=4.0, reference=3.5, guardrail=3.5, workflow=1.5, tests=1.5, prompt=1.0. Risk penalties: low=0.0, medium=2.0, high=4.5. Protected path adds +2.5.

Multi-Reviewer Panel

Reviewer	Focus	Risk Sensitivity
structural	docs (5.0), reference (4.0)	1.0x
conservative	guardrail (5.0), penalizes prompt (0.5)	1.5x
user_advocate	workflow (4.0), prompt (3.0)	0.8x
security_auditor	guardrail (5.0), tests (3.0)	2.0x

Cognitive labels: CONSENSUS (all agree) -> shared recommendation. VERIFIED (2+ agree) -> majority. DISPUTED (no majority) -> forced hold.

LLM Judge

Evaluates 4 dimensions (0.0-1.0): clarity, specificity, consistency, safety. Thresholds: approve >= 0.75, reject \x3C 0.40, else conditional.

Backend	Model	Key	Fallback
claude	claude-sonnet-4-20250514	`ANTHROPIC_API_KEY` (supports `ANTHROPIC_BASE_URL`)	mock
openai	gpt-4o-mini	`OPENAI_API_KEY`	mock
mock	none	none	deterministic, confidence=0.5

Blockers

protected_target, executor_not_supported, not_auto_keep_category, risk_medium/risk_high, skill_level_insufficient_for_structural_change, evaluator_reject, llm_judge_reject

Output JSON Example

{
  "run_id": "abc-123", "stage": "ranked", "critic_mode": "multi-reviewer-panel",
  "scored_candidates": [{
    "id": "cand-001", "score": 7.25, "recommendation": "accept_for_execution",
    "blockers": [], "judge_notes": ["低风险候选，可交给 executor。"],
    "panel": {
      "panel_reviews": [{"reviewer": "structural", "score": 8.5}, {"reviewer": "conservative", "score": 6.0}],
      "cognitive_label": "CONSENSUS", "aggregated_score": 7.25
    },
    "llm_verdict": {"score": 0.82, "decision": "approve",
      "dimensions": {"clarity": 0.85, "specificity": 0.80, "consistency": 0.80, "safety": 0.90}}
  }],
  "summary": {"accept_for_execution": 1, "hold": 0, "reject": 0}
}

\x3Cexample> Panel + LLM judge: $ python3 scripts/score.py --input candidates.json --panel --llm-judge mock --output scored.json \x3C/example>

\x3Canti-example> --panel and --llm-judge are NOT mutually exclusive. Each reviewer independently calls the LLM judge. \x3C/anti-example>

Related Skills

improvement-generator -- produces candidates | improvement-gate -- keep/revert/reject
improvement-learner -- structural 6-dim eval | benchmark-store -- frozen baselines

安全使用建议

Key points before installing/using: - The SKILL.md mentions LLM backends that require ANTHROPIC_API_KEY or OPENAI_API_KEY but the registry metadata lists no required env vars — treat API keys as optional only if you use the mock backend. If you enable --llm-judge with a real backend, provide credentials only if you trust the skill. - The evaluator can load and call arbitrary Python modules (RealSkillEvaluator / importlib). Do not point it at untrusted directories or candidate artifacts that contain executable code unless you sandbox the run (e.g., isolated container, restricted runtime). - The skill reads and writes local state (default state/). Review scripts/score.py and the RealSkillEvaluator implementation to confirm exactly what files are read/written and whether any network calls beyond LLM providers occur (some files were truncated in the listing). - If you want to avoid external API calls, run with --llm-judge mock and/or inspect/disable the LLMJudge backend code. - Recommended actions: inspect scripts/score.py and the RealSkillEvaluator code paths in full, run the skill in a sandboxed environment first, and only supply API keys if needed and you understand what vectors (file loading, outbound network) will be used.

功能分析

Type: OpenClaw Skill Name: auto-improvement-discriminator Version: 1.0.0 The skill bundle provides a framework for evaluating agent skill improvements using heuristics, LLM-based judging, and automated testing. It contains high-risk capabilities, most notably the RealSkillEvaluator in interfaces/critic_engine.py, which dynamically loads and executes arbitrary Python code using importlib.util.exec_module. While intended for evaluating skill logic, this facilitates Remote Code Execution (RCE) if the input paths are untrusted. Furthermore, the LLMJudge in interfaces/llm_judge.py is susceptible to indirect prompt injection from candidate content, and the JUnitXMLRegressionAdapter in interfaces/external_regression.py uses the insecure xml.etree.ElementTree parser, posing an XML External Entity (XXE) risk. These components represent significant vulnerabilities rather than intentional malware.

能力评估

ℹ Purpose & Capability

The skill intends to score and rank improvement candidates and to optionally use an LLM judge — that aligns with the included code (critic engine, human review, llm_judge). However the registry metadata lists no required environment variables while SKILL.md and llm_judge.py explicitly document/use ANTHROPIC_API_KEY / OPENAI_API_KEY (and optional base URLs). This mismatch is noteworthy: the skill can operate in mock mode without keys, but using real LLM backends requires credentials that were not declared in the registry metadata.

ℹ Instruction Scope

SKILL.md instructs running scripts/score.py with --input and optional flags (panel, --llm-judge, --use-evaluator-evidence). That matches the repo's scripts. The instructions do not overtly request arbitrary system secrets, but the implementation will read/write state (default state/ path) and can save human review receipts to disk. More importantly, the Critic/RealSkillEvaluator supports loading Python skill modules from file paths and invoking evaluate()/execute() functions — this can execute arbitrary code supplied as a candidate or present on disk, which expands runtime scope beyond simple scoring.

✓ Install Mechanism

No install spec is provided (instruction-only install), and no external downloads are performed by the package itself. The code may import third-party SDKs (anthropic, openai) at runtime if the user enables those backends, but there is no automatic installer or external URL fetch in the provided metadata.

⚠ Credentials

Registry metadata declares no required env vars, but SKILL.md and interfaces/llm_judge.py document/use ANTHROPIC_API_KEY and OPENAI_API_KEY (and support ANTHROPIC_BASE_URL). The skill will attempt to call networked LLM backends when --llm-judge is used, which requires those keys. This is a mismatch between declared requirements and actual code. Additionally, the code inserts a sibling 'benchmark-store' path and attempts to import from it — that implies access to other local skill code/config, which increases the scope of data accessible at runtime.

ℹ Persistence & Privilege

always:false and autonomous invocation are default/normal. The skill writes state and review receipts (HumanReviewReceipt.save writes files under state paths) and the critic engine may load external benchmark or skill modules from disk. It does not explicitly modify other skills' configurations, but loading/executing other Python modules grants it runtime privilege equivalent to executing arbitrary code if untrusted modules or paths are supplied.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install auto-improvement-discriminator
安装完成后，直接呼叫该 Skill 的名称或使用 /auto-improvement-discriminator 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial release: closed-loop skill improvement pipeline

元数据

Slug auto-improvement-discriminator

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题