← Back to Skills Marketplace

agentic-eval

Name: agentic-eval
Author: boleyn

by santian · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ Security Clean

403

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install agentic-eval

Description

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building eval...

README (SKILL.md)

Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

Quality-critical generation: Code, reports, analysis requiring high accuracy
Tasks with clear evaluation criteria: Defined success metrics exist
Content requiring specific standards: Style guides, compliance, formatting

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\
{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\
Original: {output}")
    
    return output

Key insight: Use structured JSON output for reliable parsing of critique results.

Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\
Output: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\
Code: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\
Code: {code}")
        return code

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

RUBRIC = {
    "accuracy": {"weight": 0.4},
    "clarity": {"weight": 0.3},
    "completeness": {"weight": 0.3}
}

def evaluate_with_rubric(output: str, rubric: dict) -> float:
    scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\
Output: {output}"))
    return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5

Best Practices

Practice	Rationale
Clear criteria	Define specific, measurable evaluation criteria upfront
Iteration limits	Set max iterations (3-5) to prevent infinite loops
Convergence check	Stop if output score isn't improving between iterations
Log history	Keep full trajectory for debugging and analysis
Structured output	Use JSON for reliable parsing of evaluation results

Quick Start Checklist

## Evaluation Implementation Checklist

### Setup
- [ ] Define evaluation criteria/rubric
- [ ] Set score threshold for "good enough"
- [ ] Configure max iterations (default: 3)

### Implementation
- [ ] Implement generate() function
- [ ] Implement evaluate() function with structured output
- [ ] Implement optimize() function
- [ ] Wire up the refinement loop

### Safety
- [ ] Add convergence detection
- [ ] Log all iterations for debugging
- [ ] Handle evaluation parse failures gracefully

Usage Guidance

This skill appears coherent and benign, but it prescribes executing generated code and running tests as part of refinement loops. Before deploying: (1) run generated code and tests inside a sandbox or isolated CI environment to avoid executing untrusted code; (2) enforce iteration limits, convergence checks, and parse/validation of LLM JSON outputs to avoid infinite loops or malformed feedback; (3) treat LLM-evaluations as fallible (LLM-as-judge can hallucinate) and add human review for critical decisions; (4) ensure the agent or environment executing tests has no access to secrets or sensitive systems. If you want to restrict autonomous behavior, consider disabling automatic invocation or requiring human confirmation for execution steps.

Capability Analysis

Type: OpenClaw Skill Name: agentic-eval Version: 1.0.0 The skill bundle 'agentic-eval' provides educational patterns and conceptual Python snippets for implementing AI agent self-improvement loops, such as reflection and evaluator-optimizer pipelines. The content in SKILL.md is purely instructional and lacks any indicators of data exfiltration, malicious execution, or prompt injection attacks.

Capability Assessment

✓ Purpose & Capability

Name/description (agentic evaluation, reflection loops, evaluator-optimizer patterns) match the SKILL.md content. No unrelated binaries, env vars, or install steps are requested.

ℹ Instruction Scope

SKILL.md stays on-topic (generating, evaluating, critiquing, refining) and uses LLM calls and structured JSON. It also suggests running tests (run_tests) and executing generated code in a loop — this is expected for code-refinement patterns but implies executing generated code and test harnesses, which should be sandboxed and access-controlled.

✓ Install Mechanism

No install spec and no code files — instruction-only. This minimizes on-disk/third-party install risk.

✓ Credentials

Skill requires no environment variables, credentials, or config paths. Nothing disproportionate to the stated purpose is requested.

✓ Persistence & Privilege

always is false and model invocation is allowed (platform default). The skill does not request permanent presence or modify other skills/settings.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install agentic-eval
After installation, invoke the skill by name or use /agentic-eval
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Import from LeoYeAI/openclaw-master-skills on 2026-03-09

Metadata

Slug agentic-eval

Version 1.0.0

License MIT-0

All-time Installs 7

Active Installs 6

Total Versions 1

Frequently Asked Questions

What is agentic-eval?

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building eval... It is an AI Agent Skill for Claude Code / OpenClaw, with 403 downloads so far.

How do I install agentic-eval?

Run "/install agentic-eval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is agentic-eval free?

Yes, agentic-eval is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does agentic-eval support?

agentic-eval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created agentic-eval?

It is built and maintained by santian (@boleyn); the current version is v1.0.0.

More Skills