← Back to Skills Marketplace
lanyasheng

Improvement Discriminator

by _silhouette · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
79
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install auto-improvement-discriminator
Description
当需要对改进候选多人盲审打分、用 LLM 做语义评估、判断候选是否应被接受、或打分结果全是 hold 想知道为什么时使用。支持 --panel 多审阅者盲审和 --llm-judge 语义评估。不用于结构评估(用 improvement-learner)或门禁决策(用 improvement-gate)。
README (SKILL.md)

Improvement Discriminator

Multi-signal scoring engine. Blends heuristic rules, evaluator rubrics, LLM-as-Judge, and multi-reviewer blind panel to score, rank, and recommend actions on improvement candidates.

When to Use / NOT to Use

  • Score and rank candidates, run panel blind review, run LLM-as-Judge semantic evaluation, diagnose hold results
  • NOT for structural evaluation (improvement-learner), gate decisions (improvement-gate), or file changes (improvement-executor)

CLI

python3 scripts/score.py --input CANDS.json [--output SCORED.json] [--state-root DIR]
  [--panel] [--llm-judge {claude,openai,mock}] [--use-evaluator-evidence]
Param Description
--input Required. Candidate artifact JSON from generator
--output Output path. Default: {state-root}/rankings/{run_id}.json
--state-root State directory. Default: state/
--panel Enable 4-reviewer blind panel (structural, conservative, user_advocate, security_auditor)
--llm-judge Enable LLM-as-Judge. Backends: claude (Anthropic API), openai, mock (deterministic, no key)
--use-evaluator-evidence Blend skill-evaluator rubric/category/boundary evidence

Scoring Modes and Blending Weights

Mode Blending
Heuristic only (default) 100% heuristic (base 4.0 + category bonus + source refs - risk penalty)
--use-evaluator-evidence 70% heuristic + 30% evaluator
--llm-judge 60% heuristic + 40% LLM
Both flags 50% heuristic + 30% LLM + 20% evaluator
--panel 4 reviewers score independently; cognitive label decides final recommendation

Category bonuses: docs=4.0, reference=3.5, guardrail=3.5, workflow=1.5, tests=1.5, prompt=1.0. Risk penalties: low=0.0, medium=2.0, high=4.5. Protected path adds +2.5.

Multi-Reviewer Panel

Reviewer Focus Risk Sensitivity
structural docs (5.0), reference (4.0) 1.0x
conservative guardrail (5.0), penalizes prompt (0.5) 1.5x
user_advocate workflow (4.0), prompt (3.0) 0.8x
security_auditor guardrail (5.0), tests (3.0) 2.0x

Cognitive labels: CONSENSUS (all agree) -> shared recommendation. VERIFIED (2+ agree) -> majority. DISPUTED (no majority) -> forced hold.

LLM Judge

Evaluates 4 dimensions (0.0-1.0): clarity, specificity, consistency, safety. Thresholds: approve >= 0.75, reject \x3C 0.40, else conditional.

Backend Model Key Fallback
claude claude-sonnet-4-20250514 ANTHROPIC_API_KEY (supports ANTHROPIC_BASE_URL) mock
openai gpt-4o-mini OPENAI_API_KEY mock
mock none none deterministic, confidence=0.5

Blockers

protected_target, executor_not_supported, not_auto_keep_category, risk_medium/risk_high, skill_level_insufficient_for_structural_change, evaluator_reject, llm_judge_reject

Output JSON Example

{
  "run_id": "abc-123", "stage": "ranked", "critic_mode": "multi-reviewer-panel",
  "scored_candidates": [{
    "id": "cand-001", "score": 7.25, "recommendation": "accept_for_execution",
    "blockers": [], "judge_notes": ["低风险候选,可交给 executor。"],
    "panel": {
      "panel_reviews": [{"reviewer": "structural", "score": 8.5}, {"reviewer": "conservative", "score": 6.0}],
      "cognitive_label": "CONSENSUS", "aggregated_score": 7.25
    },
    "llm_verdict": {"score": 0.82, "decision": "approve",
      "dimensions": {"clarity": 0.85, "specificity": 0.80, "consistency": 0.80, "safety": 0.90}}
  }],
  "summary": {"accept_for_execution": 1, "hold": 0, "reject": 0}
}

\x3Cexample> Panel + LLM judge: $ python3 scripts/score.py --input candidates.json --panel --llm-judge mock --output scored.json \x3C/example>

\x3Canti-example> --panel and --llm-judge are NOT mutually exclusive. Each reviewer independently calls the LLM judge. \x3C/anti-example>

Related Skills

  • improvement-generator -- produces candidates | improvement-gate -- keep/revert/reject
  • improvement-learner -- structural 6-dim eval | benchmark-store -- frozen baselines
Usage Guidance
Key points before installing/using: - The SKILL.md mentions LLM backends that require ANTHROPIC_API_KEY or OPENAI_API_KEY but the registry metadata lists no required env vars — treat API keys as optional only if you use the mock backend. If you enable --llm-judge with a real backend, provide credentials only if you trust the skill. - The evaluator can load and call arbitrary Python modules (RealSkillEvaluator / importlib). Do not point it at untrusted directories or candidate artifacts that contain executable code unless you sandbox the run (e.g., isolated container, restricted runtime). - The skill reads and writes local state (default state/). Review scripts/score.py and the RealSkillEvaluator implementation to confirm exactly what files are read/written and whether any network calls beyond LLM providers occur (some files were truncated in the listing). - If you want to avoid external API calls, run with --llm-judge mock and/or inspect/disable the LLMJudge backend code. - Recommended actions: inspect scripts/score.py and the RealSkillEvaluator code paths in full, run the skill in a sandboxed environment first, and only supply API keys if needed and you understand what vectors (file loading, outbound network) will be used.
Capability Analysis
Type: OpenClaw Skill Name: auto-improvement-discriminator Version: 1.0.0 The skill bundle provides a framework for evaluating agent skill improvements using heuristics, LLM-based judging, and automated testing. It contains high-risk capabilities, most notably the RealSkillEvaluator in interfaces/critic_engine.py, which dynamically loads and executes arbitrary Python code using importlib.util.exec_module. While intended for evaluating skill logic, this facilitates Remote Code Execution (RCE) if the input paths are untrusted. Furthermore, the LLMJudge in interfaces/llm_judge.py is susceptible to indirect prompt injection from candidate content, and the JUnitXMLRegressionAdapter in interfaces/external_regression.py uses the insecure xml.etree.ElementTree parser, posing an XML External Entity (XXE) risk. These components represent significant vulnerabilities rather than intentional malware.
Capability Assessment
Purpose & Capability
The skill intends to score and rank improvement candidates and to optionally use an LLM judge — that aligns with the included code (critic engine, human review, llm_judge). However the registry metadata lists no required environment variables while SKILL.md and llm_judge.py explicitly document/use ANTHROPIC_API_KEY / OPENAI_API_KEY (and optional base URLs). This mismatch is noteworthy: the skill can operate in mock mode without keys, but using real LLM backends requires credentials that were not declared in the registry metadata.
Instruction Scope
SKILL.md instructs running scripts/score.py with --input and optional flags (panel, --llm-judge, --use-evaluator-evidence). That matches the repo's scripts. The instructions do not overtly request arbitrary system secrets, but the implementation will read/write state (default state/ path) and can save human review receipts to disk. More importantly, the Critic/RealSkillEvaluator supports loading Python skill modules from file paths and invoking evaluate()/execute() functions — this can execute arbitrary code supplied as a candidate or present on disk, which expands runtime scope beyond simple scoring.
Install Mechanism
No install spec is provided (instruction-only install), and no external downloads are performed by the package itself. The code may import third-party SDKs (anthropic, openai) at runtime if the user enables those backends, but there is no automatic installer or external URL fetch in the provided metadata.
Credentials
Registry metadata declares no required env vars, but SKILL.md and interfaces/llm_judge.py document/use ANTHROPIC_API_KEY and OPENAI_API_KEY (and support ANTHROPIC_BASE_URL). The skill will attempt to call networked LLM backends when --llm-judge is used, which requires those keys. This is a mismatch between declared requirements and actual code. Additionally, the code inserts a sibling 'benchmark-store' path and attempts to import from it — that implies access to other local skill code/config, which increases the scope of data accessible at runtime.
Persistence & Privilege
always:false and autonomous invocation are default/normal. The skill writes state and review receipts (HumanReviewReceipt.save writes files under state paths) and the critic engine may load external benchmark or skill modules from disk. It does not explicitly modify other skills' configurations, but loading/executing other Python modules grants it runtime privilege equivalent to executing arbitrary code if untrusted modules or paths are supplied.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install auto-improvement-discriminator
  3. After installation, invoke the skill by name or use /auto-improvement-discriminator
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release: closed-loop skill improvement pipeline
Metadata
Slug auto-improvement-discriminator
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Improvement Discriminator?

当需要对改进候选多人盲审打分、用 LLM 做语义评估、判断候选是否应被接受、或打分结果全是 hold 想知道为什么时使用。支持 --panel 多审阅者盲审和 --llm-judge 语义评估。不用于结构评估(用 improvement-learner)或门禁决策(用 improvement-gate)。 It is an AI Agent Skill for Claude Code / OpenClaw, with 79 downloads so far.

How do I install Improvement Discriminator?

Run "/install auto-improvement-discriminator" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Improvement Discriminator free?

Yes, Improvement Discriminator is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Improvement Discriminator support?

Improvement Discriminator is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Improvement Discriminator?

It is built and maintained by _silhouette (@lanyasheng); the current version is v1.0.0.

💬 Comments