/install agentic-eval
Agentic Evaluation Patterns
Patterns for self-improvement through iterative evaluation and refinement.
Overview
Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.
Generate → Evaluate → Critique → Refine → Output
↑ │
└──────────────────────────────┘
When to Use
- Quality-critical generation: Code, reports, analysis requiring high accuracy
- Tasks with clear evaluation criteria: Defined success metrics exist
- Content requiring specific standards: Style guides, compliance, formatting
Pattern 1: Basic Reflection
Agent evaluates and improves its own output through self-critique.
def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
"""Generate with reflection loop."""
output = llm(f"Complete this task:\
{task}")
for i in range(max_iterations):
# Self-critique
critique = llm(f"""
Evaluate this output against criteria: {criteria}
Output: {output}
Rate each: PASS/FAIL with feedback as JSON.
""")
critique_data = json.loads(critique)
all_pass = all(c["status"] == "PASS" for c in critique_data.values())
if all_pass:
return output
# Refine based on critique
failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
output = llm(f"Improve to address: {failed}\
Original: {output}")
return output
Key insight: Use structured JSON output for reliable parsing of critique results.
Pattern 2: Evaluator-Optimizer
Separate generation and evaluation into distinct components for clearer responsibilities.
class EvaluatorOptimizer:
def __init__(self, score_threshold: float = 0.8):
self.score_threshold = score_threshold
def generate(self, task: str) -> str:
return llm(f"Complete: {task}")
def evaluate(self, output: str, task: str) -> dict:
return json.loads(llm(f"""
Evaluate output for task: {task}
Output: {output}
Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
"""))
def optimize(self, output: str, feedback: dict) -> str:
return llm(f"Improve based on feedback: {feedback}\
Output: {output}")
def run(self, task: str, max_iterations: int = 3) -> str:
output = self.generate(task)
for _ in range(max_iterations):
evaluation = self.evaluate(output, task)
if evaluation["overall_score"] >= self.score_threshold:
break
output = self.optimize(output, evaluation)
return output
Pattern 3: Code-Specific Reflection
Test-driven refinement loop for code generation.
class CodeReflector:
def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
code = llm(f"Write Python code for: {spec}")
tests = llm(f"Generate pytest tests for: {spec}\
Code: {code}")
for _ in range(max_iterations):
result = run_tests(code, tests)
if result["success"]:
return code
code = llm(f"Fix error: {result['error']}\
Code: {code}")
return code
Evaluation Strategies
Outcome-Based
Evaluate whether output achieves the expected result.
def evaluate_outcome(task: str, output: str, expected: str) -> str:
return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")
LLM-as-Judge
Use LLM to compare and rank outputs.
def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")
Rubric-Based
Score outputs against weighted dimensions.
RUBRIC = {
"accuracy": {"weight": 0.4},
"clarity": {"weight": 0.3},
"completeness": {"weight": 0.3}
}
def evaluate_with_rubric(output: str, rubric: dict) -> float:
scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\
Output: {output}"))
return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5
Best Practices
| Practice | Rationale |
|---|---|
| Clear criteria | Define specific, measurable evaluation criteria upfront |
| Iteration limits | Set max iterations (3-5) to prevent infinite loops |
| Convergence check | Stop if output score isn't improving between iterations |
| Log history | Keep full trajectory for debugging and analysis |
| Structured output | Use JSON for reliable parsing of evaluation results |
Quick Start Checklist
## Evaluation Implementation Checklist
### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)
### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop
### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install agentic-eval - 安装完成后,直接呼叫该 Skill 的名称或使用
/agentic-eval触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
agentic-eval 是什么?
Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building eval... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 403 次。
如何安装 agentic-eval?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install agentic-eval」即可一键安装,无需额外配置。
agentic-eval 是免费的吗?
是的,agentic-eval 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
agentic-eval 支持哪些平台?
agentic-eval 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 agentic-eval?
由 santian(@boleyn)开发并维护,当前版本 v1.0.0。