← 返回 Skills 市场
ngmeyer

LLM as Judge

作者 Neal Meyer · GitHub ↗ · v1.2.0 · MIT-0
cross-platform ✓ 安全检测通过
185
总下载
0
收藏
1
当前安装
3
版本数
在 OpenClaw 中安装
/install llm-as-judge
功能描述
Cross-model verification for complex tasks. Spawn a judge subagent with a different model to review plans, code, architecture, or decisions before execution....
使用说明 (SKILL.md)

LLM-as-Judge

Core principle: Same model = same blind spots. Different model = fresh perspective. Cross-model review catches ~85% of issues vs ~60% for self-reflection.

Activation Criteria

Use this pattern when:

  • Architecture or system design decisions
  • Multi-file changes affecting >5 files or >500 LOC
  • Security-critical code (auth, payments, crypto/DeFi)
  • Financial/trading systems (market making, quant strategies)
  • Planning documents that will drive weeks of work
  • Stuck after 3+ failed attempts on same problem

Skip when:

  • Simple edits, config tweaks, bug fixes with obvious cause
  • Documentation updates
  • Single-file changes under 100 LOC
  • Tasks where self-review is sufficient

The Pattern

Executor (Model A) → Output → Judge (Model B) → Verdict → Action

Verdicts: APPROVE | REVISE (with specific feedback) | REJECT (restart)

Model Pairing

Use a different provider than the executor to avoid shared blind spots:

  • Executor: Claude → Judge: kimi or grok or gemini-pro
  • Executor: Kimi/Gemini → Judge: opus
  • Principle: Different provider, similar capability tier

Judge Prompt Templates

Plan/Architecture Review

See references/judge-prompts.md for full templates covering:

  • Plan completeness, feasibility, risk, testing strategy
  • Architecture review with scoring (0-10 per dimension)
  • Code review checklist (correctness, design, safety, maintainability)

Integration Points

  • With adversarial review: This IS the formalized version of "spawn a separate model to review"
  • With planning-protocol: Judge reviews the plan before the Execute phase
  • With coding workflows: Code → cross-model review → fix findings → test → build → push

Quick Decision

Simple task?           → Self-review
Complex / high stakes? → LLM-as-Judge
Stuck after retries?   → LLM-as-Judge (fresh perspective)
Financial/security?    → LLM-as-Judge (mandatory)

Gotchas

  • Same provider defeats the purpose — Claude Opus judging Claude Sonnet shares the same training distribution. Use a different provider (Grok judging Claude, Gemini judging GPT, etc.).
  • Vague judge output is useless — If the judge says "looks good" without specifics, the prompt is too weak. Always require the judge to produce scored dimensions + specific actionable items, even if approving.
  • Judge scope creep — Judges sometimes rewrite the entire plan instead of reviewing it. Constrain the verdict to APPROVE / REVISE / REJECT with specific feedback, not a replacement solution.
  • Approval rate drift — If the judge approves >80% of submissions, the model pairing is too similar or the prompts are too lenient. Target 60-70% approval rate.
  • Don't judge trivial tasks — A 50-line CSS fix doesn't need cross-model review. Use the activation criteria in this skill strictly.
安全使用建议
This is a coherent, low-risk prompt/workflow template for cross-model review. Before installing or using it, confirm your agent/platform can actually call alternative models/providers as the skill expects; if that involves third‑party APIs, avoid sending sensitive secrets or personal data to judge models, and check provider data handling and costs. If you plan to run security‑critical or proprietary code through an external judge model, obtain explicit consent and consider local/private review alternatives.
功能分析
Type: OpenClaw Skill Name: llm-as-judge Version: 1.2.0 The 'llm-as-judge' skill implements a standard cross-model verification pattern designed to improve output quality and catch errors in complex tasks like architecture design and security reviews. The files (SKILL.md and references/judge-prompts.md) contain legitimate workflow instructions and prompt templates for peer review without any evidence of data exfiltration, malicious execution, or prompt injection attacks.
能力评估
Purpose & Capability
Name and description match the content: the skill defines a prompt-and-workflow for spawning a judge subagent using a different model for review. It does not request binaries, credentials, or system access that would be out of scope for a cross-model review pattern. It does reference specific providers/models (Claude, Kimi, Grok, Gemini, Opus), which is an expectation about available model providers rather than a secret or extra entitlement.
Instruction Scope
SKILL.md and templates are scoped to reviewing plans, code, and high‑stakes systems and constrain judge output to APPROVE/REVISE/REJECT with scored feedback. There is no instruction to read unrelated files, access environment secrets, or call external endpoints. Note: in practice using third‑party judge models may involve sending potentially sensitive project data to another provider — the skill does not explicitly warn about avoiding secrets or PHI when sending content to an external judge model.
Install Mechanism
Instruction-only skill with no install spec and no code files. This is low risk and expected for a prompt/workflow template.
Credentials
The skill declares no required environment variables or credentials, which is coherent for a pattern. However it presumes the agent/platform can invoke alternative models/providers; in real use you may need provider credentials or API keys (not declared here). Consider whether your agent will route judge calls to third‑party providers and whether those providers will receive sensitive data.
Persistence & Privilege
always is false, no requested persistent presence, and the skill does not attempt to modify other skills or system settings. Autonomous invocation is allowed (platform default) but this is not combined with elevated privileges or secret access.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install llm-as-judge
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /llm-as-judge 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.2.0
Remove project-specific references (QuantFlow, internal agent names). Fully generic and framework-agnostic. Activation criteria, model pairing, and gotchas unchanged.
v1.1.0
Add Gotchas section: same-provider blind spots, vague output, scope creep, approval rate drift, trivial task exclusions.
v1.0.0
Initial release of the LLM-as-Judge skill implementing cross-model verification. - Introduces the LLM-as-Judge pattern: spawn a subagent using a different model to review plans, code, or decisions. - Provides decision guidance on when to use LLM-as-Judge vs. self-review. - Includes review prompt templates for plans and code. - Offers best-practices for model pairing across providers for fresh perspectives. - Lists anti-patterns, integration examples, and metrics for measuring effectiveness.
元数据
Slug llm-as-judge
版本 1.2.0
许可证 MIT-0
累计安装 1
当前安装数 1
历史版本数 3
常见问题

LLM as Judge 是什么?

Cross-model verification for complex tasks. Spawn a judge subagent with a different model to review plans, code, architecture, or decisions before execution.... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 185 次。

如何安装 LLM as Judge?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install llm-as-judge」即可一键安装,无需额外配置。

LLM as Judge 是免费的吗?

是的,LLM as Judge 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

LLM as Judge 支持哪些平台?

LLM as Judge 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 LLM as Judge?

由 Neal Meyer(@ngmeyer)开发并维护,当前版本 v1.2.0。

💬 留言讨论