功能描述

Build a cost-efficient LLM evaluation ensemble with sampling, tiebreakers, and deterministic validators. Learned from 600+ production runs judging local Olla...

使用说明 (SKILL.md)

LLM-as-Judge

Name: Llm As Judge
Author: nissan

Build a cost-efficient LLM evaluation ensemble for comparing and scoring generative AI outputs at scale.

When to Use

Evaluating generative AI outputs across multiple models at scale (100+ runs)
Comparing local/OSS models against cloud baselines in shadow-testing pipelines
Building promotion gates where models must prove quality before serving production traffic
Any scenario where deterministic tests alone can't capture output quality

When NOT to Use

One-off evaluations (just read the output yourself)
Tasks with deterministic correct answers (use exact-match or unit tests)
When you can't afford any external API calls (this pattern uses Claude/GPT as judges)

Architecture: Three-Layer Evaluation

Layer 1: Deterministic Validators (Free, Instant)

Run on 100% of outputs. Zero cost. Catches obvious failures before burning judge tokens.

JSON schema validation — does the output parse? Does it match the expected schema?
Regex checks — required fields present, format constraints met
Length bounds — output within acceptable min/max character count
Entity presence — do required entities from the input appear in the output?

If Layer 1 fails, score is 0.0 — no need to invoke expensive judges.

Layer 2: Heuristic Drift Detection (Cheap, Fast)

Run on 100% of outputs that pass Layer 1. Minimal cost (local computation only).

Entity overlap — what fraction of entities in the ground truth appear in the candidate?
Numerical consistency — do numbers in the output match source data?
Novel fact detection — does the output introduce facts not present in the input/context? Novel facts suggest hallucination.
Structural similarity — does the output follow the same structural pattern as ground truth?

Layer 2 produces heuristic scores (0.0–1.0) that contribute to the final weighted score.

Layer 3: LLM Judges (Expensive, High Quality)

Sampled at 15% of runs to control cost. Forced to 100% during promotion gates.

Two independent judges (e.g., Claude + GPT-4o) score the output. Each judge evaluates all 6 dimensions independently.

Tiebreaker pattern: When primary judges disagree by Δ ≥ 0.20 on any dimension, a third judge is invoked. The tiebreaker score replaces the outlier. This reduced score variance by 34% at only 8% additional cost.

The 6 Scoring Dimensions

Dimension	Weight	What It Measures
Structural accuracy	0.20	Format compliance, schema adherence
Semantic similarity	0.25	Meaning preservation vs ground truth
Factual accuracy	0.25	Correctness of facts, numbers, entities
Task completion	0.15	Does it actually answer the question?
Tool use correctness	0.05	Valid tool calls (when applicable)
Latency	0.10	Response time within acceptable bounds

Weights are configurable per task type. Tool use weight is redistributed when not applicable.

Critical Lesson: None ≠ 0.0

When a dimension is not sampled (LLM judge not invoked on this run), record the score as null, not 0.0. Unsampled dimensions must be excluded from the weighted average, not treated as failures.

Early bug: recording unsampled dimensions as 0.0 created a systematic 0.03–0.08 downward bias across all models. The fix: null means "not measured", which is fundamentally different from "scored zero".

# WRONG — penalises unsampled dimensions
weighted = sum(s * w for s, w in zip(scores, weights)) / sum(weights)

# RIGHT — exclude null dimensions
pairs = [(s, w) for s, w in zip(scores, weights) if s is not None]
weighted = sum(s * w for s, w in pairs) / sum(w for _, w in pairs)

Cost Estimate

With 15% LLM sampling, average cost per evaluated run: ~$0.003

Layer 1 + Layer 2: $0.00 (local computation)
Layer 3 (15% of runs): ~$0.02 per judged run × 0.15 = ~$0.003
Tiebreaker (fires ~12% of judged runs): adds ~$0.0003

At 200 runs for promotion: total judge cost ≈ $0.60 per model per task type.

Worked Example: Summarisation Evaluation

from evaluation import JudgeEnsemble, DeterministicValidator, HeuristicScorer

# Layer 1: must be valid text, 50-500 chars
validator = DeterministicValidator(
    min_length=50,
    max_length=500,
    required_format="text",
)

# Layer 2: check entity overlap with source
heuristic = HeuristicScorer(
    check_entity_overlap=True,
    check_novel_facts=True,
    check_numerical_consistency=True,
)

# Layer 3: LLM judges (sampled)
ensemble = JudgeEnsemble(
    judges=["claude-sonnet-4-20250514", "gpt-4o"],
    tiebreaker="claude-sonnet-4-20250514",
    sample_rate=0.15,
    tiebreaker_threshold=0.20,
    dimensions=["structural", "semantic", "factual", "completion", "latency"],
)

# Evaluate
result = ensemble.evaluate(
    task_type="summarize",
    ground_truth=gt_response,
    candidate=candidate_response,
    source_text=original_text,
    validator=validator,
    heuristic=heuristic,
)

print(f"Weighted score: {result.weighted_score:.3f}")
print(f"Dimensions: {result.scores}")  # {semantic: 0.95, factual: 0.88, ...}
# None values for unsampled dimensions

Tips

Start with Layer 1 — you'd be surprised how many outputs fail basic validation
Log everything — store raw judge responses for debugging score disputes
Calibrate on 50 runs — before trusting the ensemble, manually review 50 outputs against judge scores
Watch for judge drift — LLM judges can be inconsistent across API versions; pin model versions
Force judges at gates — 15% sampling is fine for monitoring, but promotion decisions need 100% coverage on the final batch

安全使用建议

This skill appears to implement what it says — a three-layer judge ensemble that uses local validators plus sampled cloud LLM judges — but there are a couple of practical mismatches to consider before installing: - It declares both ANTHROPIC_API_KEY and OPENAI_API_KEY as required. If you only intend to run a single cloud judge or only local Ollama inference, you should confirm whether both keys are actually needed or if the skill could accept one optionally. Avoid supplying unnecessary credentials. - It lists ollama as a required binary even though local use is described as optional. If you won't run Ollama locally, ensure the skill can operate without it rather than installing/using a local runtime you don't want. - Because the skill performs outbound network calls to third-party APIs, expect billing and data sent to Anthropic/OpenAI. Review the upstream behavior (repository or code) to confirm what request payloads include and whether any sample data or logs would be transmitted. If you decide to use it: provide only the credentials needed for your intended mode (prefer least privilege), monitor API usage/costs during initial runs, and review the upstream repo (https://github.com/reddinft/skill-llm-as-judge) or any implementation before giving full access to production credentials.

功能分析

Type: OpenClaw Skill Name: reddi-llm-judge Version: 1.0.1 The skill is classified as benign. It transparently declares its need for outbound network access to call Anthropic and OpenAI APIs for LLM judging, which is consistent with its stated purpose. It also requests `ANTHROPIC_API_KEY` and `OPENAI_API_KEY` environment variables, which are necessary for these API calls. The `SKILL.md` content is purely instructional and descriptive, with no evidence of prompt injection attempts, hidden commands, or other malicious instructions for the agent. The illustrative Python code snippet also appears benign and aligns with the skill's functionality.

能力评估

ℹ Purpose & Capability

The skill claims to build an ensemble that uses local (Ollama) and cloud judges (Anthropic + OpenAI). Requiring python3 plus ollama, and both ANTHROPIC_API_KEY and OPENAI_API_KEY is consistent with the described two-cloud-plus-local architecture. However, the docs state judges are sampled and local inference can be used alone; making ollama and both cloud keys mandatory (in requires lists) is stricter than the description implies and may be unnecessary for some legitimate uses (e.g., cloud-only or local-only evaluation).

✓ Instruction Scope

SKILL.md contains architecture, validation and scoring rules, and example Python usage. It explicitly describes network outbound calls to Anthropic and OpenAI and local inference via Ollama. The instructions do not direct the agent to read unrelated system files or other secrets, nor do they instruct exfiltration to unexpected endpoints.

✓ Install Mechanism

This is instruction-only (no install spec and no code files), so nothing is written to disk by an installer. That lowers installation risk. The skill does rely on external binaries being present on PATH rather than installing them itself.

⚠ Credentials

The skill requires two cloud API keys (ANTHROPIC_API_KEY and OPENAI_API_KEY) and marks Anthropic as the primary credential. Requiring both keys is reasonable if the ensemble always needs two cloud judges, but the doc describes sampling and optional components (e.g., optional Gemini tiebreaker, local Ollama use). Making both cloud keys mandatory and insisting on an ollama executable is disproportionate to some described modes of operation and reduces flexibility. If you only plan to run local/offline evaluations or only one cloud provider, forcing both keys and ollama may be unnecessary and increases secret exposure.

✓ Persistence & Privilege

The skill is not always-enabled and does not request special persistent privileges. It does permit autonomous invocation by default (platform normal), but there's no indication it modifies other skills or system-wide configs.

版本历史

v1.0.1

Added network disclosure, primaryEnv, bins

v1.0.0

- Initial release of llm-judge-ensemble. - Introduces a three-layer evaluation architecture: deterministic validators, heuristic drift detection, and sampled LLM judges with a tiebreaker pattern. - Supports evaluation across 6 configurable scoring dimensions with weighted averaging. - Optimizes cost by sampling expensive LLM judges at 15% of runs and resolving disagreements via a third judge. - Addresses proper handling of unsampled dimensions (null vs. 0.0) to prevent scoring bias. - Provides practical usage guidance, cost estimates, and example code for rapid integration.

元数据

Slug reddi-llm-judge

版本 1.0.1

许可证 —

累计安装 1

当前安装数 1

历史版本数 2

常见问题

Llm As Judge 是什么？

Build a cost-efficient LLM evaluation ensemble with sampling, tiebreakers, and deterministic validators. Learned from 600+ production runs judging local Olla... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 391 次。

如何安装 Llm As Judge？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install reddi-llm-judge」即可一键安装，无需额外配置。

Llm As Judge 是免费的吗？

是的，Llm As Judge 完全免费（开源免费），可自由下载、安装和使用。

Llm As Judge 支持哪些平台？

Llm As Judge 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Llm As Judge？

由 Nissan Dookeran（@nissan）开发并维护，当前版本 v1.0.1。

Llm As Judge