← 返回 Skills 市场
boleyn

agentic-eval

作者 santian · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
403
总下载
0
收藏
6
当前安装
1
版本数
在 OpenClaw 中安装
/install agentic-eval
功能描述
Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building eval...
使用说明 (SKILL.md)

Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

  • Quality-critical generation: Code, reports, analysis requiring high accuracy
  • Tasks with clear evaluation criteria: Defined success metrics exist
  • Content requiring specific standards: Style guides, compliance, formatting

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\
{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\
Original: {output}")
    
    return output

Key insight: Use structured JSON output for reliable parsing of critique results.


Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\
Output: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\
Code: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\
Code: {code}")
        return code

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\
Output: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

Practice Rationale
Clear criteria Define specific, measurable evaluation criteria upfront
Iteration limits Set max iterations (3-5) to prevent infinite loops
Convergence check Stop if output score isn't improving between iterations
Log history Keep full trajectory for debugging and analysis
Structured output Use JSON for reliable parsing of evaluation results

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)

### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop

### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully
安全使用建议
This skill appears coherent and benign, but it prescribes executing generated code and running tests as part of refinement loops. Before deploying: (1) run generated code and tests inside a sandbox or isolated CI environment to avoid executing untrusted code; (2) enforce iteration limits, convergence checks, and parse/validation of LLM JSON outputs to avoid infinite loops or malformed feedback; (3) treat LLM-evaluations as fallible (LLM-as-judge can hallucinate) and add human review for critical decisions; (4) ensure the agent or environment executing tests has no access to secrets or sensitive systems. If you want to restrict autonomous behavior, consider disabling automatic invocation or requiring human confirmation for execution steps.
功能分析
Type: OpenClaw Skill Name: agentic-eval Version: 1.0.0 The skill bundle 'agentic-eval' provides educational patterns and conceptual Python snippets for implementing AI agent self-improvement loops, such as reflection and evaluator-optimizer pipelines. The content in SKILL.md is purely instructional and lacks any indicators of data exfiltration, malicious execution, or prompt injection attacks.
能力评估
Purpose & Capability
Name/description (agentic evaluation, reflection loops, evaluator-optimizer patterns) match the SKILL.md content. No unrelated binaries, env vars, or install steps are requested.
Instruction Scope
SKILL.md stays on-topic (generating, evaluating, critiquing, refining) and uses LLM calls and structured JSON. It also suggests running tests (run_tests) and executing generated code in a loop — this is expected for code-refinement patterns but implies executing generated code and test harnesses, which should be sandboxed and access-controlled.
Install Mechanism
No install spec and no code files — instruction-only. This minimizes on-disk/third-party install risk.
Credentials
Skill requires no environment variables, credentials, or config paths. Nothing disproportionate to the stated purpose is requested.
Persistence & Privilege
always is false and model invocation is allowed (platform default). The skill does not request permanent presence or modify other skills/settings.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install agentic-eval
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /agentic-eval 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Import from LeoYeAI/openclaw-master-skills on 2026-03-09
元数据
Slug agentic-eval
版本 1.0.0
许可证 MIT-0
累计安装 7
当前安装数 6
历史版本数 1
常见问题

agentic-eval 是什么?

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building eval... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 403 次。

如何安装 agentic-eval?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install agentic-eval」即可一键安装,无需额外配置。

agentic-eval 是免费的吗?

是的,agentic-eval 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

agentic-eval 支持哪些平台?

agentic-eval 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 agentic-eval?

由 santian(@boleyn)开发并维护,当前版本 v1.0.0。

💬 留言讨论