← Back to Skills Marketplace

LLM as Judge

Name: LLM as Judge
Author: ngmeyer

by Neal Meyer · GitHub ↗ · v1.2.0 · MIT-0

cross-platform ✓ Security Clean

185

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install llm-as-judge

Description

Cross-model verification for complex tasks. Spawn a judge subagent with a different model to review plans, code, architecture, or decisions before execution....

README (SKILL.md)

LLM-as-Judge

Core principle: Same model = same blind spots. Different model = fresh perspective. Cross-model review catches ~85% of issues vs ~60% for self-reflection.

Activation Criteria

Use this pattern when:

Architecture or system design decisions
Multi-file changes affecting >5 files or >500 LOC
Security-critical code (auth, payments, crypto/DeFi)
Financial/trading systems (market making, quant strategies)
Planning documents that will drive weeks of work
Stuck after 3+ failed attempts on same problem

Skip when:

Simple edits, config tweaks, bug fixes with obvious cause
Documentation updates
Single-file changes under 100 LOC
Tasks where self-review is sufficient

The Pattern

Executor (Model A) → Output → Judge (Model B) → Verdict → Action

Verdicts: APPROVE | REVISE (with specific feedback) | REJECT (restart)

Model Pairing

Use a different provider than the executor to avoid shared blind spots:

Executor: Claude → Judge: kimi or grok or gemini-pro
Executor: Kimi/Gemini → Judge: opus
Principle: Different provider, similar capability tier

Judge Prompt Templates

Plan/Architecture Review

See references/judge-prompts.md for full templates covering:

Plan completeness, feasibility, risk, testing strategy
Architecture review with scoring (0-10 per dimension)
Code review checklist (correctness, design, safety, maintainability)

Integration Points

With adversarial review: This IS the formalized version of "spawn a separate model to review"
With planning-protocol: Judge reviews the plan before the Execute phase
With coding workflows: Code → cross-model review → fix findings → test → build → push

Quick Decision

Simple task?           → Self-review
Complex / high stakes? → LLM-as-Judge
Stuck after retries?   → LLM-as-Judge (fresh perspective)
Financial/security?    → LLM-as-Judge (mandatory)

Gotchas

Same provider defeats the purpose — Claude Opus judging Claude Sonnet shares the same training distribution. Use a different provider (Grok judging Claude, Gemini judging GPT, etc.).
Vague judge output is useless — If the judge says "looks good" without specifics, the prompt is too weak. Always require the judge to produce scored dimensions + specific actionable items, even if approving.
Judge scope creep — Judges sometimes rewrite the entire plan instead of reviewing it. Constrain the verdict to APPROVE / REVISE / REJECT with specific feedback, not a replacement solution.
Approval rate drift — If the judge approves >80% of submissions, the model pairing is too similar or the prompts are too lenient. Target 60-70% approval rate.
Don't judge trivial tasks — A 50-line CSS fix doesn't need cross-model review. Use the activation criteria in this skill strictly.

Usage Guidance

This is a coherent, low-risk prompt/workflow template for cross-model review. Before installing or using it, confirm your agent/platform can actually call alternative models/providers as the skill expects; if that involves third‑party APIs, avoid sending sensitive secrets or personal data to judge models, and check provider data handling and costs. If you plan to run security‑critical or proprietary code through an external judge model, obtain explicit consent and consider local/private review alternatives.

Capability Analysis

Type: OpenClaw Skill Name: llm-as-judge Version: 1.2.0 The 'llm-as-judge' skill implements a standard cross-model verification pattern designed to improve output quality and catch errors in complex tasks like architecture design and security reviews. The files (SKILL.md and references/judge-prompts.md) contain legitimate workflow instructions and prompt templates for peer review without any evidence of data exfiltration, malicious execution, or prompt injection attacks.

Capability Assessment

✓ Purpose & Capability

Name and description match the content: the skill defines a prompt-and-workflow for spawning a judge subagent using a different model for review. It does not request binaries, credentials, or system access that would be out of scope for a cross-model review pattern. It does reference specific providers/models (Claude, Kimi, Grok, Gemini, Opus), which is an expectation about available model providers rather than a secret or extra entitlement.

ℹ Instruction Scope

SKILL.md and templates are scoped to reviewing plans, code, and high‑stakes systems and constrain judge output to APPROVE/REVISE/REJECT with scored feedback. There is no instruction to read unrelated files, access environment secrets, or call external endpoints. Note: in practice using third‑party judge models may involve sending potentially sensitive project data to another provider — the skill does not explicitly warn about avoiding secrets or PHI when sending content to an external judge model.

✓ Install Mechanism

Instruction-only skill with no install spec and no code files. This is low risk and expected for a prompt/workflow template.

ℹ Credentials

The skill declares no required environment variables or credentials, which is coherent for a pattern. However it presumes the agent/platform can invoke alternative models/providers; in real use you may need provider credentials or API keys (not declared here). Consider whether your agent will route judge calls to third‑party providers and whether those providers will receive sensitive data.

✓ Persistence & Privilege

always is false, no requested persistent presence, and the skill does not attempt to modify other skills or system settings. Autonomous invocation is allowed (platform default) but this is not combined with elevated privileges or secret access.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install llm-as-judge
After installation, invoke the skill by name or use /llm-as-judge
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.2.0

Remove project-specific references (QuantFlow, internal agent names). Fully generic and framework-agnostic. Activation criteria, model pairing, and gotchas unchanged.

v1.1.0

Add Gotchas section: same-provider blind spots, vague output, scope creep, approval rate drift, trivial task exclusions.

v1.0.0

Initial release of the LLM-as-Judge skill implementing cross-model verification. - Introduces the LLM-as-Judge pattern: spawn a subagent using a different model to review plans, code, or decisions. - Provides decision guidance on when to use LLM-as-Judge vs. self-review. - Includes review prompt templates for plans and code. - Offers best-practices for model pairing across providers for fresh perspectives. - Lists anti-patterns, integration examples, and metrics for measuring effectiveness.

Metadata

Slug llm-as-judge

Version 1.2.0

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 3

Frequently Asked Questions

What is LLM as Judge?

Cross-model verification for complex tasks. Spawn a judge subagent with a different model to review plans, code, architecture, or decisions before execution.... It is an AI Agent Skill for Claude Code / OpenClaw, with 185 downloads so far.

How do I install LLM as Judge?

Run "/install llm-as-judge" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is LLM as Judge free?

Yes, LLM as Judge is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does LLM as Judge support?

LLM as Judge is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created LLM as Judge?

It is built and maintained by Neal Meyer (@ngmeyer); the current version is v1.2.0.

More Skills