← Back to Skills Marketplace
ngmeyer

LLM as Judge

by Neal Meyer · GitHub ↗ · v1.2.0 · MIT-0
cross-platform ✓ Security Clean
185
Downloads
0
Stars
1
Active Installs
3
Versions
Install in OpenClaw
/install llm-as-judge
Description
Cross-model verification for complex tasks. Spawn a judge subagent with a different model to review plans, code, architecture, or decisions before execution....
README (SKILL.md)

LLM-as-Judge

Core principle: Same model = same blind spots. Different model = fresh perspective. Cross-model review catches ~85% of issues vs ~60% for self-reflection.

Activation Criteria

Use this pattern when:

  • Architecture or system design decisions
  • Multi-file changes affecting >5 files or >500 LOC
  • Security-critical code (auth, payments, crypto/DeFi)
  • Financial/trading systems (market making, quant strategies)
  • Planning documents that will drive weeks of work
  • Stuck after 3+ failed attempts on same problem

Skip when:

  • Simple edits, config tweaks, bug fixes with obvious cause
  • Documentation updates
  • Single-file changes under 100 LOC
  • Tasks where self-review is sufficient

The Pattern

Executor (Model A) → Output → Judge (Model B) → Verdict → Action

Verdicts: APPROVE | REVISE (with specific feedback) | REJECT (restart)

Model Pairing

Use a different provider than the executor to avoid shared blind spots:

  • Executor: Claude → Judge: kimi or grok or gemini-pro
  • Executor: Kimi/Gemini → Judge: opus
  • Principle: Different provider, similar capability tier

Judge Prompt Templates

Plan/Architecture Review

See references/judge-prompts.md for full templates covering:

  • Plan completeness, feasibility, risk, testing strategy
  • Architecture review with scoring (0-10 per dimension)
  • Code review checklist (correctness, design, safety, maintainability)

Integration Points

  • With adversarial review: This IS the formalized version of "spawn a separate model to review"
  • With planning-protocol: Judge reviews the plan before the Execute phase
  • With coding workflows: Code → cross-model review → fix findings → test → build → push

Quick Decision

Simple task?           → Self-review
Complex / high stakes? → LLM-as-Judge
Stuck after retries?   → LLM-as-Judge (fresh perspective)
Financial/security?    → LLM-as-Judge (mandatory)

Gotchas

  • Same provider defeats the purpose — Claude Opus judging Claude Sonnet shares the same training distribution. Use a different provider (Grok judging Claude, Gemini judging GPT, etc.).
  • Vague judge output is useless — If the judge says "looks good" without specifics, the prompt is too weak. Always require the judge to produce scored dimensions + specific actionable items, even if approving.
  • Judge scope creep — Judges sometimes rewrite the entire plan instead of reviewing it. Constrain the verdict to APPROVE / REVISE / REJECT with specific feedback, not a replacement solution.
  • Approval rate drift — If the judge approves >80% of submissions, the model pairing is too similar or the prompts are too lenient. Target 60-70% approval rate.
  • Don't judge trivial tasks — A 50-line CSS fix doesn't need cross-model review. Use the activation criteria in this skill strictly.
Usage Guidance
This is a coherent, low-risk prompt/workflow template for cross-model review. Before installing or using it, confirm your agent/platform can actually call alternative models/providers as the skill expects; if that involves third‑party APIs, avoid sending sensitive secrets or personal data to judge models, and check provider data handling and costs. If you plan to run security‑critical or proprietary code through an external judge model, obtain explicit consent and consider local/private review alternatives.
Capability Analysis
Type: OpenClaw Skill Name: llm-as-judge Version: 1.2.0 The 'llm-as-judge' skill implements a standard cross-model verification pattern designed to improve output quality and catch errors in complex tasks like architecture design and security reviews. The files (SKILL.md and references/judge-prompts.md) contain legitimate workflow instructions and prompt templates for peer review without any evidence of data exfiltration, malicious execution, or prompt injection attacks.
Capability Assessment
Purpose & Capability
Name and description match the content: the skill defines a prompt-and-workflow for spawning a judge subagent using a different model for review. It does not request binaries, credentials, or system access that would be out of scope for a cross-model review pattern. It does reference specific providers/models (Claude, Kimi, Grok, Gemini, Opus), which is an expectation about available model providers rather than a secret or extra entitlement.
Instruction Scope
SKILL.md and templates are scoped to reviewing plans, code, and high‑stakes systems and constrain judge output to APPROVE/REVISE/REJECT with scored feedback. There is no instruction to read unrelated files, access environment secrets, or call external endpoints. Note: in practice using third‑party judge models may involve sending potentially sensitive project data to another provider — the skill does not explicitly warn about avoiding secrets or PHI when sending content to an external judge model.
Install Mechanism
Instruction-only skill with no install spec and no code files. This is low risk and expected for a prompt/workflow template.
Credentials
The skill declares no required environment variables or credentials, which is coherent for a pattern. However it presumes the agent/platform can invoke alternative models/providers; in real use you may need provider credentials or API keys (not declared here). Consider whether your agent will route judge calls to third‑party providers and whether those providers will receive sensitive data.
Persistence & Privilege
always is false, no requested persistent presence, and the skill does not attempt to modify other skills or system settings. Autonomous invocation is allowed (platform default) but this is not combined with elevated privileges or secret access.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install llm-as-judge
  3. After installation, invoke the skill by name or use /llm-as-judge
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.2.0
Remove project-specific references (QuantFlow, internal agent names). Fully generic and framework-agnostic. Activation criteria, model pairing, and gotchas unchanged.
v1.1.0
Add Gotchas section: same-provider blind spots, vague output, scope creep, approval rate drift, trivial task exclusions.
v1.0.0
Initial release of the LLM-as-Judge skill implementing cross-model verification. - Introduces the LLM-as-Judge pattern: spawn a subagent using a different model to review plans, code, or decisions. - Provides decision guidance on when to use LLM-as-Judge vs. self-review. - Includes review prompt templates for plans and code. - Offers best-practices for model pairing across providers for fresh perspectives. - Lists anti-patterns, integration examples, and metrics for measuring effectiveness.
Metadata
Slug llm-as-judge
Version 1.2.0
License MIT-0
All-time Installs 1
Active Installs 1
Total Versions 3
Frequently Asked Questions

What is LLM as Judge?

Cross-model verification for complex tasks. Spawn a judge subagent with a different model to review plans, code, architecture, or decisions before execution.... It is an AI Agent Skill for Claude Code / OpenClaw, with 185 downloads so far.

How do I install LLM as Judge?

Run "/install llm-as-judge" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is LLM as Judge free?

Yes, LLM as Judge is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does LLM as Judge support?

LLM as Judge is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created LLM as Judge?

It is built and maintained by Neal Meyer (@ngmeyer); the current version is v1.2.0.

💬 Comments