LLM as Judge
/install llm-as-judge
LLM-as-Judge
Core principle: Same model = same blind spots. Different model = fresh perspective. Cross-model review catches ~85% of issues vs ~60% for self-reflection.
Activation Criteria
Use this pattern when:
- Architecture or system design decisions
- Multi-file changes affecting >5 files or >500 LOC
- Security-critical code (auth, payments, crypto/DeFi)
- Financial/trading systems (market making, quant strategies)
- Planning documents that will drive weeks of work
- Stuck after 3+ failed attempts on same problem
Skip when:
- Simple edits, config tweaks, bug fixes with obvious cause
- Documentation updates
- Single-file changes under 100 LOC
- Tasks where self-review is sufficient
The Pattern
Executor (Model A) → Output → Judge (Model B) → Verdict → Action
Verdicts: APPROVE | REVISE (with specific feedback) | REJECT (restart)
Model Pairing
Use a different provider than the executor to avoid shared blind spots:
- Executor: Claude → Judge:
kimiorgrokorgemini-pro - Executor: Kimi/Gemini → Judge:
opus - Principle: Different provider, similar capability tier
Judge Prompt Templates
Plan/Architecture Review
See references/judge-prompts.md for full templates covering:
- Plan completeness, feasibility, risk, testing strategy
- Architecture review with scoring (0-10 per dimension)
- Code review checklist (correctness, design, safety, maintainability)
Integration Points
- With adversarial review: This IS the formalized version of "spawn a separate model to review"
- With planning-protocol: Judge reviews the plan before the Execute phase
- With coding workflows: Code → cross-model review → fix findings → test → build → push
Quick Decision
Simple task? → Self-review
Complex / high stakes? → LLM-as-Judge
Stuck after retries? → LLM-as-Judge (fresh perspective)
Financial/security? → LLM-as-Judge (mandatory)
Gotchas
- Same provider defeats the purpose — Claude Opus judging Claude Sonnet shares the same training distribution. Use a different provider (Grok judging Claude, Gemini judging GPT, etc.).
- Vague judge output is useless — If the judge says "looks good" without specifics, the prompt is too weak. Always require the judge to produce scored dimensions + specific actionable items, even if approving.
- Judge scope creep — Judges sometimes rewrite the entire plan instead of reviewing it. Constrain the verdict to APPROVE / REVISE / REJECT with specific feedback, not a replacement solution.
- Approval rate drift — If the judge approves >80% of submissions, the model pairing is too similar or the prompts are too lenient. Target 60-70% approval rate.
- Don't judge trivial tasks — A 50-line CSS fix doesn't need cross-model review. Use the activation criteria in this skill strictly.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install llm-as-judge - 安装完成后,直接呼叫该 Skill 的名称或使用
/llm-as-judge触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
LLM as Judge 是什么?
Cross-model verification for complex tasks. Spawn a judge subagent with a different model to review plans, code, architecture, or decisions before execution.... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 185 次。
如何安装 LLM as Judge?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install llm-as-judge」即可一键安装,无需额外配置。
LLM as Judge 是免费的吗?
是的,LLM as Judge 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
LLM as Judge 支持哪些平台?
LLM as Judge 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 LLM as Judge?
由 Neal Meyer(@ngmeyer)开发并维护,当前版本 v1.2.0。