← 返回 Skills 市场

agentic-eval

Name: agentic-eval
Author: boleyn

作者 santian · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ 安全检测通过

403

总下载

当前安装

版本数

在 OpenClaw 中安装

/install agentic-eval

功能描述

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building eval...

使用说明 (SKILL.md)

Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

Quality-critical generation: Code, reports, analysis requiring high accuracy
Tasks with clear evaluation criteria: Defined success metrics exist
Content requiring specific standards: Style guides, compliance, formatting

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\
{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\
Original: {output}")
    
    return output

Key insight: Use structured JSON output for reliable parsing of critique results.

Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\
Output: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\
Code: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\
Code: {code}")
        return code

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\
Output: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

Practice	Rationale
Clear criteria	Define specific, measurable evaluation criteria upfront
Iteration limits	Set max iterations (3-5) to prevent infinite loops
Convergence check	Stop if output score isn't improving between iterations
Log history	Keep full trajectory for debugging and analysis
Structured output	Use JSON for reliable parsing of evaluation results

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)

### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop

### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully

安全使用建议

This skill appears coherent and benign, but it prescribes executing generated code and running tests as part of refinement loops. Before deploying: (1) run generated code and tests inside a sandbox or isolated CI environment to avoid executing untrusted code; (2) enforce iteration limits, convergence checks, and parse/validation of LLM JSON outputs to avoid infinite loops or malformed feedback; (3) treat LLM-evaluations as fallible (LLM-as-judge can hallucinate) and add human review for critical decisions; (4) ensure the agent or environment executing tests has no access to secrets or sensitive systems. If you want to restrict autonomous behavior, consider disabling automatic invocation or requiring human confirmation for execution steps.

功能分析

Type: OpenClaw Skill Name: agentic-eval Version: 1.0.0 The skill bundle 'agentic-eval' provides educational patterns and conceptual Python snippets for implementing AI agent self-improvement loops, such as reflection and evaluator-optimizer pipelines. The content in SKILL.md is purely instructional and lacks any indicators of data exfiltration, malicious execution, or prompt injection attacks.

能力评估

✓ Purpose & Capability

Name/description (agentic evaluation, reflection loops, evaluator-optimizer patterns) match the SKILL.md content. No unrelated binaries, env vars, or install steps are requested.

ℹ Instruction Scope

SKILL.md stays on-topic (generating, evaluating, critiquing, refining) and uses LLM calls and structured JSON. It also suggests running tests (run_tests) and executing generated code in a loop — this is expected for code-refinement patterns but implies executing generated code and test harnesses, which should be sandboxed and access-controlled.

✓ Install Mechanism

No install spec and no code files — instruction-only. This minimizes on-disk/third-party install risk.

✓ Credentials

Skill requires no environment variables, credentials, or config paths. Nothing disproportionate to the stated purpose is requested.

✓ Persistence & Privilege

always is false and model invocation is allowed (platform default). The skill does not request permanent presence or modify other skills/settings.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install agentic-eval
安装完成后，直接呼叫该 Skill 的名称或使用 /agentic-eval 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Import from LeoYeAI/openclaw-master-skills on 2026-03-09

元数据

Slug agentic-eval

版本 1.0.0

许可证 MIT-0

累计安装 7

当前安装数 6

历史版本数 1

常见问题

agentic-eval 是什么？

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building eval... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 403 次。

如何安装 agentic-eval？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install agentic-eval」即可一键安装，无需额外配置。

agentic-eval 是免费的吗？

是的，agentic-eval 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

agentic-eval 支持哪些平台？

agentic-eval 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 agentic-eval？

由 santian（@boleyn）开发并维护，当前版本 v1.0.0。