← Back to Skills Marketplace
anderskev

Llm Judge

by Kevin Anderson · GitHub ↗ · v1.0.3 · MIT-0
cross-platform ✓ Security Clean
278
Downloads
0
Stars
1
Active Installs
4
Versions
Install in OpenClaw
/install llm-judge
Description
Use when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations",...
README (SKILL.md)

LLM Judge

Compare code implementations across multiple repositories using structured evaluation.

Usage

/beagle-analysis:llm-judge \x3Cspec> \x3Crepo1> \x3Crepo2> [repo3...] [--labels=...] [--weights=...] [--branch=...]

Arguments

Argument Required Description
spec Yes Path to spec/requirements document
repos Yes 2+ paths to repositories to compare
--labels No Comma-separated labels (default: directory names)
--weights No Override weights, e.g. functionality:40,security:30
--branch No Branch to compare against main (default: main)

Workflow

  1. Parse $ARGUMENTS into spec_path, repo_paths, labels, weights, and branch.
  2. Validate the spec file, each repo path, and the minimum repo count.
  3. Read the spec document into memory.
  4. Load this skill and the supporting reference files.
  5. Spawn one Phase 1 repo agent per repository to gather facts only.
  6. Validate the repo-agent JSON results before proceeding.
  7. Spawn one Phase 2 judge agent per dimension.
  8. Aggregate scores, compute weighted totals, rank repos, and write the report.
  9. Display the markdown summary and verify the JSON report.

Hard gates

Sequenced workflow: do not start the next phase until the current gate passes. Each pass condition must be checkable (file on disk, non-empty content, or json.load succeeds)—not “I reviewed internally.”

Gate Pass condition Unblocks
A — Inputs spec_path is a readable file and non-empty; len(repo_paths) ≥ 2; each path contains .git. Phase 1 repo agents
B — Phase 1 facts For each repo agent output: stdin/stdout parses as JSON; required keys/shape match references/fact-schema.md. Phase 2 judge agents
C — Phase 2 scores Five judge outputs (one per dimension) each parse as JSON; each includes a score (and justification) for every repo label. Aggregation
D — Report file .beagle/llm-judge-report.json exists; python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" exits 0. Markdown summary to the user
E — Consistency Summary table and verdict use the same labels, weights, and per-dimension scores as the JSON report. Mark task complete

Parallelism is allowed within a phase (all Phase 1 tasks together; all Phase 2 tasks together), but Phase 2 must not start until Gate B passes, and the user-visible summary must not precede Gate D.

Command Workflow

Step 1: Parse Arguments

Parse $ARGUMENTS to extract:

  • spec_path: first positional argument
  • repo_paths: remaining positional arguments (must be 2+)
  • labels: from --labels or derived from directory names
  • weights: from --weights or defaults
  • branch: from --branch or main

Default Weights:

{
  "functionality": 30,
  "security": 25,
  "tests": 20,
  "overengineering": 15,
  "dead_code": 10
}

Step 2: Validate Inputs

[ -f "$SPEC_PATH" ] || { echo "Error: Spec file not found: $SPEC_PATH"; exit 1; }

for repo in "${REPO_PATHS[@]}"; do
  [ -d "$repo/.git" ] || { echo "Error: Not a git repository: $repo"; exit 1; }
done

[ ${#REPO_PATHS[@]} -ge 2 ] || { echo "Error: Need at least 2 repositories to compare"; exit 1; }

Step 3: Read Spec Document

SPEC_CONTENT=$(cat "$SPEC_PATH") || { echo "Error: Failed to read spec file: $SPEC_PATH"; exit 1; }
[ -z "$SPEC_CONTENT" ] && { echo "Error: Spec file is empty: $SPEC_PATH"; exit 1; }

Step 4: Load the Skill

Load the llm-judge skill: Skill(skill: "beagle-analysis:llm-judge")

Step 5: Phase 1 - Spawn Repo Agents

Spawn one Task per repo:

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $LABEL at $REPO_PATH

**Spec Document:**
$SPEC_CONTENT

**Instructions:**
1. Load skill: Skill(skill: "beagle-analysis:llm-judge")
2. Read references/repo-agent.md for detailed instructions
3. Read references/fact-schema.md for the output format
4. Load Skill(skill: "beagle-core:llm-artifacts-detection") for analysis

Explore the repository and gather facts. Return ONLY valid JSON following the fact schema.

Do NOT score or judge. Only gather facts.

Collect all repo outputs into ALL_FACTS.

Step 6: Validate Phase 1 Results

echo "$FACTS" | python3 -c "import json,sys; json.load(sys.stdin)" 2>/dev/null || { echo "Error: Invalid JSON from $LABEL"; exit 1; }

Step 7: Phase 2 - Spawn Judge Agents

Spawn five judge agents, one per dimension:

You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:**
1. Load skill: Skill(skill: "beagle-analysis:llm-judge")
2. Read references/judge-agents.md for detailed instructions
3. Read references/scoring-rubrics.md for the $DIMENSION rubric

Score each repo on $DIMENSION. Return ONLY valid JSON with scores and justifications.

Step 8: Aggregate Scores

for repo_label in labels:
    scores[repo_label] = {}
    for dimension in dimensions:
        scores[repo_label][dimension] = judge_outputs[dimension]['scores'][repo_label]

    weighted_total = sum(
        scores[repo_label][dim]['score'] * weights[dim] / 100
        for dim in dimensions
    )
    scores[repo_label]['weighted_total'] = round(weighted_total, 2)

ranking = sorted(labels, key=lambda l: scores[l]['weighted_total'], reverse=True)

Step 9: Generate Verdict

Name the winner, explain why they won, and note any close calls or trade-offs.

Step 10: Write JSON Report

mkdir -p .beagle

Write .beagle/llm-judge-report.json with version, timestamp, repo metadata, weights, scores, ranking, and verdict.

Step 11: Display Summary

Render a markdown summary with the scores table, ranking, verdict, and detailed justifications.

Step 12: Verification

python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" && echo "Valid report"

Output Shape

The generated report should include:

  • repo labels and paths
  • per-dimension scores and justifications
  • weighted totals and ranking
  • a verdict explaining the winner

Reference Files

File Purpose
references/fact-schema.md JSON schema for Phase 1 facts
references/scoring-rubrics.md Detailed rubrics for each dimension
references/repo-agent.md Instructions for Phase 1 agents
references/judge-agents.md Instructions for Phase 2 judges

Scoring Model

Dimension Default Weight Evaluates
Functionality 30% Spec compliance, test pass rate
Security 25% Vulnerabilities, security patterns
Test Quality 20% Coverage, DRY, mock boundaries
Overengineering 15% Unnecessary complexity
Dead Code 10% Unused code, TODOs

Scoring Scale

Score Meaning
5 Excellent - Exceeds expectations
4 Good - Meets requirements, minor issues
3 Average - Functional but notable gaps
2 Below Average - Significant issues
1 Poor - Fails basic requirements

Phase 1: Spawning Repo Agents

For each repository, spawn a Task agent with:

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT

**Instructions:** Read @beagle:llm-judge references/repo-agent.md

Gather facts and return a JSON object following the schema in references/fact-schema.md.

Load @beagle:llm-artifacts-detection for dead code and overengineering analysis.

Return ONLY valid JSON, no markdown or explanations.

Collect all repo-agent outputs into ALL_FACTS.

Phase 2: Spawning Judge Agents

After all Phase 1 agents complete, spawn 5 judge agents, one per dimension:

You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:** Read @beagle:llm-judge references/judge-agents.md

Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.

Return ONLY valid JSON following the judge output schema.

Aggregation

  1. Collect the five judge outputs.
  2. Compute each repo's weighted total with the configured weights.
  3. Rank repos by weighted total in descending order.
  4. Generate a verdict that explains the result and any close calls.
  5. Write .beagle/llm-judge-report.json.

Output

Display a markdown summary with scores, ranking, verdict, and detailed justifications.

Verification

Before completing (maps to Hard gates D and E):

  1. Gate D: .beagle/llm-judge-report.json exists and json.load succeeds.
  2. Gate E / completeness: Every repo label has scores for every dimension; each weighted_total equals the sum over dimensions of (score × weight / 100) using the configured weights; markdown summary matches the JSON report.

Rules

  • Always validate inputs before proceeding
  • Spawn Phase 1 agents in parallel, then wait before Phase 2
  • Spawn Phase 2 agents in parallel, one per dimension
  • Every score must have a justification
  • Write the JSON report before displaying the summary
Usage Guidance
This skill appears coherent and matches its stated purpose, but it will run repository-local commands (git, tests) and load another analysis skill (beagle-core:llm-artifacts-detection). Before using: 1) run it against repositories in a sandbox/container/CI runner to avoid executing untrusted code on your workstation; 2) inspect the repositories' test suites and package scripts for network or destructive actions; 3) ensure the other referenced skill(s) are trusted; 4) be aware it will write .beagle/llm-judge-report.json locally. If you need stronger guarantees, limit network access and secrets availability in the execution environment.
Capability Analysis
Type: OpenClaw Skill Name: llm-judge Version: 1.0.3 The llm-judge skill implements a structured, multi-phase agentic workflow to evaluate and rank multiple code repositories against a requirements specification. It utilizes a Phase 1 fact-gathering stage (repo-agent.md) and a Phase 2 scoring stage (judge-agents.md) with defined rubrics (scoring-rubrics.md) and JSON schemas (fact-schema.md). While the skill executes shell commands to run test suites (e.g., pytest, npm test) and git operations, these actions are strictly aligned with its stated purpose of code analysis and quality assessment, with no evidence of malicious intent, data exfiltration, or unauthorized persistence.
Capability Assessment
Purpose & Capability
Name/description (compare codebases vs a spec) matches the declared behavior: reading a spec, inspecting repos, running git and tests, spawning fact/judge agents, and producing a JSON report. No unrelated environment variables, binaries, or third-party credentials are requested.
Instruction Scope
Instructions intentionally instruct agents to read repo files, run git commands, and execute tests (pytest/npm/go test). This is within scope for judging implementations, but running tests and executing repository code can execute arbitrary code or network calls — an operational risk that is expected but should be mitigated by running in an isolated environment.
Install Mechanism
No install spec and no code files beyond instructions and reference docs. No downloads or package installs are requested by the skill itself.
Credentials
The skill declares no required environment variables, credentials, or config paths. The referenced operations (git, reading files, running tests) are proportional to the task and don't request unrelated secrets.
Persistence & Privilege
always is false and disable-model-invocation is set (skill cannot autonomously invoke the model), so it does not request elevated persistent privileges. It will write a local report file (.beagle/llm-judge-report.json) as part of normal operation.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install llm-judge
  3. After installation, invoke the skill by name or use /llm-judge
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.3
- Added an explicit "hard gates" section defining concrete, checkable pass conditions between workflow phases (inputs, Phase 1 facts, Phase 2 scores, report file, and result consistency). - Clarified that each phase must not proceed until the corresponding gate is passed, improving validation and reliability. - Emphasized that parallelism is allowed only within phases, enforcing strict sequencing between them. - Updated workflow description to match these new gating and sequencing requirements.
v1.0.2
llm-judge 1.0.2 - Expanded skill description to specify trigger phrases, intended use cases, and clear exclusions (e.g., not for single codebase review or strategy docs). - Description now clarifies the need for a spec file and 2+ repo paths for operation. - No changes to workflow or technical steps; all changes are to the skill metadata/documentation.
v1.0.1
- Major rewrite of SKILL.md: added explicit, step-by-step CLI workflow and robust argument validation for automated use. - Improved instructions for input parsing, validation, repo/spec loading, and error handling. - Clarified and detailed the multi-phase agent interaction and JSON validation requirements. - Enhanced reporting: new output verification, explicit JSON report fields, and summary rendering. - Set disable-model-invocation: true in skill metadata for safety and reproducibility. - Retained scoring criteria, rubrics, and reference documentation with clearer separation of phases.
v1.0.0
- Initial release of llm-judge skill. - Enables automated, rubric-based comparison of code implementations across repositories. - Implements a two-phase agent process: fact gathering and dimension-specific judging. - Evaluates repositories on functionality, security, test quality, overengineering, and dead code with weighted scoring. - Generates both JSON and markdown summary reports for results.
Metadata
Slug llm-judge
Version 1.0.3
License MIT-0
All-time Installs 1
Active Installs 1
Total Versions 4
Frequently Asked Questions

What is Llm Judge?

Use when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations",... It is an AI Agent Skill for Claude Code / OpenClaw, with 278 downloads so far.

How do I install Llm Judge?

Run "/install llm-judge" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Llm Judge free?

Yes, Llm Judge is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Llm Judge support?

Llm Judge is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Llm Judge?

It is built and maintained by Kevin Anderson (@anderskev); the current version is v1.0.3.

💬 Comments