功能描述

Use when creating or updating test judgement definitions (judge_definitions) for an agent skill evaluation YAML config. Analyzes a skill's SKILL.md and refer...

使用说明 (SKILL.md)

Generate Judgements for Skill Evaluation

Name: Generate Judgements
Author: panlm

Analyze a skill's source files and produce fine-grained judge_definitions for the mlflow-skills automated evaluation framework. Each judgement is a yes/no question that an LLM judge answers by reading the execution trace.

Prerequisites

Access to the target skill directory (must contain SKILL.md)
Familiarity with the mlflow-skills YAML config format (see references/yaml-config-spec.md)

Workflow

digraph generate_judgements {
  rankdir=TB;
  node [shape=box];

  collect [label="Phase 1\
Collect & Analyze Skill Files"];
  infer [label="Phase 2\
Infer Scopes"];
  confirm_scope [label="User confirms scopes" shape=diamond];
  generate [label="Phase 3\
Generate Judgements per Scope"];
  present [label="Phase 4\
Present to User"];
  confirm_judge [label="User approves?" shape=diamond];
  write [label="Phase 5\
Write / Update YAML"];

  collect -> infer;
  infer -> confirm_scope;
  confirm_scope -> generate [label="approved"];
  confirm_scope -> infer [label="revise"];
  generate -> present;
  present -> confirm_judge;
  confirm_judge -> write [label="approved"];
  confirm_judge -> generate [label="revise"];
}

Phase 1: Collect and Analyze Skill Files

Ask the user for two inputs (or auto-detect them):

Skill directory path — the folder containing SKILL.md
Existing test config YAML path (optional) — if provided, the tool will update its judge_definitions section instead of creating a new file

Then read all available files in this order:

Priority	File	Purpose
1	`SKILL.md`	Primary source — workflow steps, behavior rules, output format
2	`references/*`	Supporting details — templates, CLI commands, query patterns
3	`README.md` / `README_CN.md`	Additional context — scope boundaries, limitations
4	Existing test config YAML	Understand current judgements to avoid duplication

While reading, extract and note:

Workflow steps — numbered steps the skill must follow
Behavior rules — "must", "always", "never", "do not" directives
Output format requirements — file naming, sections, tables, mandatory fields
Conditional branches — if/else paths that lead to different outputs
Important guidelines — the "Important Guidelines" or similar section at the end

Phase 2: Infer Scopes

Analyze the skill for distinct execution paths that produce different outputs or follow different logic. Each distinct path becomes a scope.

How to identify scopes:

Look for conditional branches in the workflow (e.g., "If X → do A; else → do B")
Look for optional steps (e.g., "Only execute this step if...")
Look for different output modes (e.g., "checklist only" vs "assessment report")

Scope naming rules:

Use lowercase, single-word or hyphenated names: checklist, assessment, research
The scope all is reserved — it means "always run regardless of test_scope"
Every skill has at least the implicit all scope for common/shared behavior

Present inferred scopes to the user with a brief description of each:

I found the following execution branches in this skill:

1. `all` — Common behavior shared across all paths
   (skill loading, doc search, categorization, source annotations)

2. `checklist` — Checklist-only output path
   (no live resource, generates checklist file, offers next steps)

3. `assessment` — Live assessment path
   (runs AWS CLI, generates assessment report, no separate checklist)

Does this look right? Should I add, remove, or rename any scope?

Wait for user confirmation before proceeding.

Phase 3: Generate Judgements

For each confirmed scope, generate fine-grained judge_definitions. Follow these rules:

3.1 Granularity Principle

One check point per judgement. Each judgement tests exactly ONE behavior or requirement.

# GOOD — one specific check
- name: sequential-mcp-calls
  scope: all
  question: >
    Check that MCP tool calls were executed sequentially...

# BAD — multiple checks crammed into one
- name: workflow-correct
  scope: all
  question: >
    Check that the agent searched docs sequentially, read pages,
    extracted items into 5 categories, and wrote the file...

3.2 Judgement Categories

Generate judgements in this order, for each scope:

Category A: Skill Loading & Invocation (scope: all)

Was the skill loaded (SKILL.md read)?
Were reference files read when needed?

Category B: Workflow Behavior (scope: all or scope-specific)

Did each workflow step execute correctly?
Were sequential/parallel execution rules followed?
Were error handling / retry rules followed?
Were conditional branches taken correctly?

Category C: Output Quality (scope: all or scope-specific)

Does the output contain all mandatory sections/categories?
Does the output follow the naming convention?
Does the output include required metadata (source annotations, IDs, etc.)?
Are quantities within expected ranges?

Category D: Scope-Specific Behavior (per non-all scope)

What is unique to this execution path?
What should NOT happen in this path? (negative checks)
What additional output/actions are expected?

Category E: Guidelines Compliance (scope: all)

Are "always/never/must" directives respected?
Is the output language correct?
Are edge cases handled?

3.3 Naming Convention

Use kebab-case names that describe the check:

skill-invoked              — skill was loaded
sequential-mcp-calls       — tool calls are sequential
doc-search-coverage        — search queries cover required topics
five-categories-complete   — output has all 5 categories
file-naming-convention     — output file name matches pattern
aws-cli-commands-executed  — CLI commands were run
no-separate-checklist-file — negative check: no extra file

3.4 Question Writing Rules

Each question field must be a self-contained instruction for the LLM judge. Follow the patterns in references/judgement-patterns.md.

Required elements in every question:

What to check — "Check that..." or "Verify that..."
Where to look — "Look in the trace for...", "Look for tool calls..."
Success criteria — specific, measurable condition for answering "yes"
Leniency guidance (when appropriate) — "Be lenient...", "Answer 'yes' if at least..."

Important clarifications to include when relevant:

Distinguish between parallel tool CALLS vs batched requests in one call
Specify minimum thresholds (e.g., "at least 4 of 5", "roughly 30-50")
Clarify what counts (e.g., "each URL in the requests array counts as a separate page")
State default answer when evidence is ambiguous (e.g., "benefit of the doubt → yes")

3.5 Negative Checks

For each scope, also generate negative judgements — things that should NOT happen:

In checklist scope: assessment-only artifacts should NOT appear
In assessment scope: checklist-only artifacts should NOT appear
Across all scopes: forbidden behaviors (e.g., parallel calls when sequential is required)

Phase 4: Present Judgements to User

Present the generated judgements grouped by scope with clear section headers:

## Generated Judgements

### Scope: all (7 judgements)
| # | Name | Check |
|---|------|-------|
| 1 | skill-invoked | Skill was loaded from .claude/skills/ |
| 2 | sequential-mcp-calls | MCP calls are sequential, not parallel |
| ... | ... | ... |

### Scope: checklist (2 judgements)
| # | Name | Check |
|---|------|-------|
| 1 | file-naming-convention | Output file follows naming pattern |
| ... | ... | ... |

### Scope: assessment (8 judgements)
| ... | ... | ... |

Total: 17 judgements across 3 scopes.

Does this look right? Should I add, remove, or modify any judgement?

Wait for user confirmation. Iterate if the user requests changes.

Phase 5: Write / Update YAML

Once approved, write the output:

If an existing YAML config was provided:

Replace only the judge_definitions: section. Preserve all other fields (name, prompt, skills, timeout_seconds, environment, etc.) exactly as they are.

Add the standard scope comment block above judge_definitions::

# ==============================================================
# Judge Definitions
#
# scope values:
#   all        — runs in all test scenarios
#   {scope1}   — only when test_scope={scope1}
#   {scope2}   — only when test_scope={scope2}
# ==============================================================
judge_definitions:

If no existing YAML was provided:

Generate a complete YAML config file. Ask the user for:

name — test run name
project_dir — temp project directory name
prompt — default prompt for the test
test_scope — default scope to use

Use sensible defaults from the skill directory name for the rest. See references/yaml-config-spec.md for the full config structure.

File naming: {skill-name}.yaml placed in the appropriate tests/configs/ directory.

After writing, inform the user of the file path and remind them:

They can override test_scope and prompt from the CLI
Empty environment values won't override existing env vars
Judges with scope: all always run

Important Guidelines

Be exhaustive: Extract every testable behavior from the skill. It's better to have too many judgements than to miss an important check. The user can always remove extras.
One point per judgement: Never combine multiple checks. If you're tempted to use "and" in a question, split it into two judgements.
Write for an LLM judge: The question will be answered by an LLM reading a raw MLflow trace (JSON with tool calls and responses). Be explicit about where to find evidence in the trace.
Include thresholds: When the skill specifies numbers (e.g., "5 search queries", "30-50 items", "at least 3 per category"), encode those in the judgement.
Respect language: Write judgement questions in English (they are consumed by an LLM judge, not shown to end users). But interact with the user in their language.
Preserve existing work: When updating an existing YAML, review current judgements first. Keep well-written ones, improve weak ones, add missing ones.

安全使用建议

This looks safe for its stated purpose. Use it only on the intended skill directory, review any YAML diff before approving writes, and do not store real API keys in generated config files. The supplied SKILL.md excerpt is marked truncated, so review the complete SKILL.md if available before relying on this assessment.

功能分析

Type: OpenClaw Skill Name: generate-judgements Version: 1.0.0 The skill bundle is a developer utility designed to automate the creation of test judgement definitions for the mlflow-skills evaluation framework. Its workflow involves reading local skill source files (SKILL.md, README.md), inferring execution scopes, and generating structured YAML configuration files. The instructions in SKILL.md and the reference documents (judgement-patterns.md, yaml-config-spec.md) are consistent with the stated purpose and do not contain any indicators of malicious intent, data exfiltration, or unauthorized command execution.

能力标签

requires-sensitive-credentials

能力评估

✓ Purpose & Capability

The stated purpose is to generate or update judge_definitions by reading a target skill's SKILL.md and related reference files; the visible workflow matches that purpose.

ℹ Instruction Scope

The skill may write or update a YAML config, but the visible workflow gates this on user review and approval.

✓ Install Mechanism

There is no install spec, no code, no required binaries, and no required environment variables; the static scanner reported no findings.

ℹ Credentials

A credential capability signal appears tied to empty OpenAI environment placeholders in the YAML config reference, not to credential collection by this skill.

✓ Persistence & Privilege

No background execution, autonomous persistence, privileged install path, session-store access, or ongoing agent behavior is shown.

版本历史

v1.0.0

Initial release of the “generate-judgements” skill for test config automation. - Analyzes a skill’s SKILL.md and reference files to extract workflow steps, output rules, and behavior requirements. - Infers execution “scopes” (distinct behavioral paths) with user confirmation. - Generates detailed, one-per-requirement judge definitions per scope, following standardized naming and clarity guidelines. - Groups judgements by category (loading, workflow, output, scope-specific, compliance), including negative checks. - Interactive workflow ensures user review before finalizing judgements.

元数据

Slug generate-judgements

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Generate Judgements 是什么？

Use when creating or updating test judgement definitions (judge_definitions) for an agent skill evaluation YAML config. Analyzes a skill's SKILL.md and refer... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 30 次。

如何安装 Generate Judgements？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install generate-judgements」即可一键安装，无需额外配置。

Generate Judgements 是免费的吗？

是的，Generate Judgements 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Generate Judgements 支持哪些平台？

Generate Judgements 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Generate Judgements？

由 panlm（@panlm）开发并维护，当前版本 v1.0.0。

Generate Judgements