← Back to Skills Marketplace

Improvement Discriminator

Name: Improvement Discriminator
Author: lanyasheng

by _silhouette · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ⚠ suspicious

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install auto-improvement-discriminator

Description

当需要对改进候选多人盲审打分、用 LLM 做语义评估、判断候选是否应被接受、或打分结果全是 hold 想知道为什么时使用。支持 --panel 多审阅者盲审和 --llm-judge 语义评估。不用于结构评估（用 improvement-learner）或门禁决策（用 improvement-gate）。

README (SKILL.md)

Improvement Discriminator

Multi-signal scoring engine. Blends heuristic rules, evaluator rubrics, LLM-as-Judge, and multi-reviewer blind panel to score, rank, and recommend actions on improvement candidates.

When to Use / NOT to Use

Score and rank candidates, run panel blind review, run LLM-as-Judge semantic evaluation, diagnose hold results
NOT for structural evaluation (improvement-learner), gate decisions (improvement-gate), or file changes (improvement-executor)

CLI

python3 scripts/score.py --input CANDS.json [--output SCORED.json] [--state-root DIR]
  [--panel] [--llm-judge {claude,openai,mock}] [--use-evaluator-evidence]

Param	Description
`--input`	Required. Candidate artifact JSON from generator
`--output`	Output path. Default: `{state-root}/rankings/{run_id}.json`
`--state-root`	State directory. Default: `state/`
`--panel`	Enable 4-reviewer blind panel (structural, conservative, user_advocate, security_auditor)
`--llm-judge`	Enable LLM-as-Judge. Backends: `claude` (Anthropic API), `openai`, `mock` (deterministic, no key)
`--use-evaluator-evidence`	Blend skill-evaluator rubric/category/boundary evidence

Scoring Modes and Blending Weights

Mode	Blending
Heuristic only (default)	100% heuristic (base 4.0 + category bonus + source refs - risk penalty)
`--use-evaluator-evidence`	70% heuristic + 30% evaluator
`--llm-judge`	60% heuristic + 40% LLM
Both flags	50% heuristic + 30% LLM + 20% evaluator
`--panel`	4 reviewers score independently; cognitive label decides final recommendation

Category bonuses: docs=4.0, reference=3.5, guardrail=3.5, workflow=1.5, tests=1.5, prompt=1.0. Risk penalties: low=0.0, medium=2.0, high=4.5. Protected path adds +2.5.

Multi-Reviewer Panel

Reviewer	Focus	Risk Sensitivity
structural	docs (5.0), reference (4.0)	1.0x
conservative	guardrail (5.0), penalizes prompt (0.5)	1.5x
user_advocate	workflow (4.0), prompt (3.0)	0.8x
security_auditor	guardrail (5.0), tests (3.0)	2.0x

Cognitive labels: CONSENSUS (all agree) -> shared recommendation. VERIFIED (2+ agree) -> majority. DISPUTED (no majority) -> forced hold.

LLM Judge

Evaluates 4 dimensions (0.0-1.0): clarity, specificity, consistency, safety. Thresholds: approve >= 0.75, reject \x3C 0.40, else conditional.

Backend	Model	Key	Fallback
claude	claude-sonnet-4-20250514	`ANTHROPIC_API_KEY` (supports `ANTHROPIC_BASE_URL`)	mock
openai	gpt-4o-mini	`OPENAI_API_KEY`	mock
mock	none	none	deterministic, confidence=0.5

Blockers

protected_target, executor_not_supported, not_auto_keep_category, risk_medium/risk_high, skill_level_insufficient_for_structural_change, evaluator_reject, llm_judge_reject

Output JSON Example

{
  "run_id": "abc-123", "stage": "ranked", "critic_mode": "multi-reviewer-panel",
  "scored_candidates": [{
    "id": "cand-001", "score": 7.25, "recommendation": "accept_for_execution",
    "blockers": [], "judge_notes": ["低风险候选，可交给 executor。"],
    "panel": {
      "panel_reviews": [{"reviewer": "structural", "score": 8.5}, {"reviewer": "conservative", "score": 6.0}],
      "cognitive_label": "CONSENSUS", "aggregated_score": 7.25
    },
    "llm_verdict": {"score": 0.82, "decision": "approve",
      "dimensions": {"clarity": 0.85, "specificity": 0.80, "consistency": 0.80, "safety": 0.90}}
  }],
  "summary": {"accept_for_execution": 1, "hold": 0, "reject": 0}
}

\x3Cexample> Panel + LLM judge: $ python3 scripts/score.py --input candidates.json --panel --llm-judge mock --output scored.json \x3C/example>

\x3Canti-example> --panel and --llm-judge are NOT mutually exclusive. Each reviewer independently calls the LLM judge. \x3C/anti-example>

Related Skills

improvement-generator -- produces candidates | improvement-gate -- keep/revert/reject
improvement-learner -- structural 6-dim eval | benchmark-store -- frozen baselines

Usage Guidance

Key points before installing/using: - The SKILL.md mentions LLM backends that require ANTHROPIC_API_KEY or OPENAI_API_KEY but the registry metadata lists no required env vars — treat API keys as optional only if you use the mock backend. If you enable --llm-judge with a real backend, provide credentials only if you trust the skill. - The evaluator can load and call arbitrary Python modules (RealSkillEvaluator / importlib). Do not point it at untrusted directories or candidate artifacts that contain executable code unless you sandbox the run (e.g., isolated container, restricted runtime). - The skill reads and writes local state (default state/). Review scripts/score.py and the RealSkillEvaluator implementation to confirm exactly what files are read/written and whether any network calls beyond LLM providers occur (some files were truncated in the listing). - If you want to avoid external API calls, run with --llm-judge mock and/or inspect/disable the LLMJudge backend code. - Recommended actions: inspect scripts/score.py and the RealSkillEvaluator code paths in full, run the skill in a sandboxed environment first, and only supply API keys if needed and you understand what vectors (file loading, outbound network) will be used.

Capability Analysis

Type: OpenClaw Skill Name: auto-improvement-discriminator Version: 1.0.0 The skill bundle provides a framework for evaluating agent skill improvements using heuristics, LLM-based judging, and automated testing. It contains high-risk capabilities, most notably the RealSkillEvaluator in interfaces/critic_engine.py, which dynamically loads and executes arbitrary Python code using importlib.util.exec_module. While intended for evaluating skill logic, this facilitates Remote Code Execution (RCE) if the input paths are untrusted. Furthermore, the LLMJudge in interfaces/llm_judge.py is susceptible to indirect prompt injection from candidate content, and the JUnitXMLRegressionAdapter in interfaces/external_regression.py uses the insecure xml.etree.ElementTree parser, posing an XML External Entity (XXE) risk. These components represent significant vulnerabilities rather than intentional malware.

Capability Assessment

ℹ Purpose & Capability

The skill intends to score and rank improvement candidates and to optionally use an LLM judge — that aligns with the included code (critic engine, human review, llm_judge). However the registry metadata lists no required environment variables while SKILL.md and llm_judge.py explicitly document/use ANTHROPIC_API_KEY / OPENAI_API_KEY (and optional base URLs). This mismatch is noteworthy: the skill can operate in mock mode without keys, but using real LLM backends requires credentials that were not declared in the registry metadata.

ℹ Instruction Scope

SKILL.md instructs running scripts/score.py with --input and optional flags (panel, --llm-judge, --use-evaluator-evidence). That matches the repo's scripts. The instructions do not overtly request arbitrary system secrets, but the implementation will read/write state (default state/ path) and can save human review receipts to disk. More importantly, the Critic/RealSkillEvaluator supports loading Python skill modules from file paths and invoking evaluate()/execute() functions — this can execute arbitrary code supplied as a candidate or present on disk, which expands runtime scope beyond simple scoring.

✓ Install Mechanism

No install spec is provided (instruction-only install), and no external downloads are performed by the package itself. The code may import third-party SDKs (anthropic, openai) at runtime if the user enables those backends, but there is no automatic installer or external URL fetch in the provided metadata.

⚠ Credentials

Registry metadata declares no required env vars, but SKILL.md and interfaces/llm_judge.py document/use ANTHROPIC_API_KEY and OPENAI_API_KEY (and support ANTHROPIC_BASE_URL). The skill will attempt to call networked LLM backends when --llm-judge is used, which requires those keys. This is a mismatch between declared requirements and actual code. Additionally, the code inserts a sibling 'benchmark-store' path and attempts to import from it — that implies access to other local skill code/config, which increases the scope of data accessible at runtime.

ℹ Persistence & Privilege

always:false and autonomous invocation are default/normal. The skill writes state and review receipts (HumanReviewReceipt.save writes files under state paths) and the critic engine may load external benchmark or skill modules from disk. It does not explicitly modify other skills' configurations, but loading/executing other Python modules grants it runtime privilege equivalent to executing arbitrary code if untrusted modules or paths are supplied.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install auto-improvement-discriminator
After installation, invoke the skill by name or use /auto-improvement-discriminator
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Initial release: closed-loop skill improvement pipeline

Metadata

Slug auto-improvement-discriminator

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Improvement Discriminator?

当需要对改进候选多人盲审打分、用 LLM 做语义评估、判断候选是否应被接受、或打分结果全是 hold 想知道为什么时使用。支持 --panel 多审阅者盲审和 --llm-judge 语义评估。不用于结构评估（用 improvement-learner）或门禁决策（用 improvement-gate）。 It is an AI Agent Skill for Claude Code / OpenClaw, with 79 downloads so far.

How do I install Improvement Discriminator?

Run "/install auto-improvement-discriminator" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Improvement Discriminator free?

Yes, Improvement Discriminator is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Improvement Discriminator support?

Improvement Discriminator is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Improvement Discriminator?

It is built and maintained by _silhouette (@lanyasheng); the current version is v1.0.0.

More Skills

Improvement Discriminator

Improvement Discriminator

When to Use / NOT to Use

CLI

Scoring Modes and Blending Weights

Multi-Reviewer Panel

LLM Judge

Blockers

Output JSON Example

Related Skills

What is Improvement Discriminator?

How do I install Improvement Discriminator?

Is Improvement Discriminator free?

Which platforms does Improvement Discriminator support?

Who created Improvement Discriminator?

💬 Comments