← Back to Skills Marketplace
zmy1006-sudo

DoubleAgent — Generator-Evaluator Dual Agent Pattern

by mingyuan · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
102
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install double-agent
Description
This skill should be used when designing, implementing, or improving any AI system that requires quality assurance through separation of generation and evalu...
README (SKILL.md)

\r \r

DoubleAgent Skill\r

\r

Purpose\r

\r The DoubleAgent pattern solves a fundamental problem in AI-generated software: AI self-evaluation bias.\r \r When a single AI agent both generates and evaluates its own output, it systematically overestimates quality — the same cognitive conflict that occurs when a student grades their own exam. The solution is to forcibly separate the two cognitive roles into independent agents with different prompts, goals, and evaluation criteria.\r \r This skill provides:\r

  1. Architecture templates for Generator-Evaluator agent pairs\r
  2. Evaluator prompt templates calibrated with few-shot scoring examples\r
  3. Iteration loop design for 5-15 round refinement cycles\r
  4. Playwright integration patterns for real browser-based evaluation\r
  5. Scoring rubric design to prevent score drift and grade inflation\r \r ---\r \r

Core Architecture\r

\r

User Goal / Spec\r
      ↓\r
 ┌─────────────┐\r
 │  Generator  │ ← Produces output (code, UI, content, data)\r
 └──────┬──────┘\r
        │ output artifact\r
        ↓\r
 ┌────────────────────────────────────┐\r
 │           Evaluator                │\r
 │  • Reads spec (NOT generator output)│\r
 │  • Operates artifact via Playwright │\r
 │    (click, fill form, navigate)     │\r
 │  • Scores on rubric (0-100)         │\r
 │  • Writes structured feedback       │\r
 └────────────────┬───────────────────┘\r
                  │ score + feedback\r
                  ↓\r
         ┌────────────────┐\r
         │ Score ≥ target? │\r
         │   YES → Done    │\r
         │   NO → Loop     │\r
         └────────┬────────┘\r
                  │\r
                  └──→ Generator (next iteration)\r
```\r
\r
**Key principle**: The Evaluator reads the **original spec**, not the Generator's output. It evaluates independently, as if it were a real user encountering the product for the first time.\r
\r
---\r
\r
## When to Apply\r
\r
| Scenario | Apply DoubleAgent? |\r
|----------|--------------------|\r
| AI-generated frontend UI with interactions | ✅ Yes |\r
| Multi-step workflow code (forms, flows) | ✅ Yes |\r
| API endpoint implementation + validation | ✅ Yes |\r
| Content generation (reports, copy, docs) | ✅ Yes (text-based evaluator) |\r
| Single-function refactoring | ⚠️ Optional |\r
| Simple config changes | ❌ Not needed |\r
\r
---\r
\r
## Implementation Steps\r
\r
### Step 1: Define the Spec Contract\r
\r
Write a clear spec that both agents will reference independently. The spec must be:\r
- Concrete (measurable outcomes, not vague goals)\r
- Observable (evaluable through interaction or inspection)\r
- Versioned (so both agents work from the same contract)\r
\r
See `references/architecture.md` for spec template.\r
\r
### Step 2: Configure the Generator Agent\r
\r
Assign the Generator a single role: **produce output that satisfies the spec**.\r
\r
- Do NOT ask the Generator to self-evaluate\r
- Do NOT include evaluation criteria in the Generator's prompt\r
- Provide: spec + iteration history + previous evaluator feedback\r
\r
### Step 3: Configure the Evaluator Agent\r
\r
Assign the Evaluator a single role: **independently verify the spec is satisfied**.\r
\r
- Load `references/evaluator-prompts.md` for calibrated prompt templates\r
- Use Playwright MCP for UI/web artifacts (real browser interaction)\r
- Use structured JSON output for scores to enable automated loop control\r
- Calibrate with few-shot examples BEFORE running (prevents grade inflation)\r
\r
### Step 4: Design the Iteration Loop\r
\r
```python\r
MAX_ROUNDS = 15\r
PASS_THRESHOLD = 80  # out of 100\r
\r
for round in range(MAX_ROUNDS):\r
    output = generator.run(spec, history)\r
    evaluation = evaluator.run(spec, output)  # Playwright-based\r
    \r
    history.append({"round": round, "score": evaluation.score, "feedback": evaluation.feedback})\r
    \r
    if evaluation.score >= PASS_THRESHOLD:\r
        break\r
    \r
    if evaluation.score_trend == "plateauing":\r
        generator.switch_approach()  # Complete strategy reset\r
```\r
\r
See `scripts/iteration_loop.py` for a complete implementation template.\r
\r
### Step 5: Calibrate the Evaluator\r
\r
To prevent score drift, run the Evaluator on 3-5 known examples FIRST:\r
- 1 example at ~30/100 (clearly bad)\r
- 1 example at ~60/100 (mediocre)\r
- 1 example at ~85/100 (good)\r
- 1 example at ~95/100 (excellent)\r
\r
If scores deviate >15 points from expected, adjust the Evaluator's prompt or rubric weights before the real run.\r
\r
---\r
\r
## Scoring Rubric Design\r
\r
Effective rubrics for software systems:\r
\r
| Dimension | Weight | What to Measure |\r
|-----------|--------|-----------------|\r
| Functional completeness | 30% | Does each spec requirement work end-to-end? |\r
| Interaction quality | 25% | Click/form/navigation behavior as a real user |\r
| Edge case handling | 20% | Error states, empty data, boundary inputs |\r
| Code/design quality | 15% | Consistency, readability, no obvious anti-patterns |\r
| Originality / craft | 10% | Avoids generic/template outputs when spec requires uniqueness |\r
\r
Adjust weights based on the domain. For content systems, increase "originality". For data pipelines, increase "edge case handling".\r
\r
---\r
\r
## Playwright Integration (for UI artifacts)\r
\r
When evaluating web/H5/mini-program outputs, the Evaluator should:\r
\r
1. **Navigate** to the deployed artifact URL\r
2. **Execute** each spec requirement as a user action sequence\r
3. **Observe** actual behavior (DOM state, network requests, visual output)\r
4. **Record** pass/fail per requirement with screenshots\r
5. **Report** structured JSON with score breakdown\r
\r
Playwright MCP tool calls to use:\r
- `playwright_navigate` → open URL\r
- `playwright_click` → interact with elements\r
- `playwright_fill` → fill form inputs\r
- `playwright_screenshot` → capture evidence\r
- `playwright_get_visible_text` → verify content\r
\r
---\r
\r
## Reference Files\r
\r
- `references/architecture.md` — Detailed architecture patterns, spec templates, and design rationale\r
- `references/evaluator-prompts.md` — Ready-to-use Evaluator prompt templates for different artifact types\r
\r
## Scripts\r
\r
- `scripts/iteration_loop.py` — Complete iteration loop implementation template\r
- `scripts/calibrate_evaluator.py` — Evaluator calibration utility\r
Usage Guidance
This skill is a coherent pattern/template for running a Generator→Evaluator loop and appears benign, but it's a framework rather than a turnkey integration. Before installing/using it: 1) Be prepared to provide and secure any Playwright/browser binaries and test environment access (the skill assumes browser automation but doesn't install it). 2) Wire run_generator() and run_evaluator() yourself — those functions currently raise NotImplementedError and contain commented examples referencing a WorkBuddy API; audit and restrict any subagent/session APIs you call. 3) If you evaluate protected services/APIs, supply scoped test credentials (not broad production keys) and rotate them afterward. 4) Be mindful of privacy: evaluator screenshots, logs, or artifact URLs may contain sensitive data—store them securely or redact before uploading. 5) If you allow autonomous invocation, review and control what target URLs the Evaluator will access to avoid accidental scanning of internal systems. If you want extra assurance, ask the developer for a concrete integration example (how it will invoke your agents and where screenshots/logs are stored) and a manifest of required local tooling (Playwright, browsers) before running in production.
Capability Assessment
Purpose & Capability
The name/description (Generator–Evaluator dual-agent QA) matches the included SKILL.md, architecture references, prompt templates, and iteration/calibration scripts. One minor gap: the SKILL.md and templates assume Playwright-based browser evaluation and agent-subagent invocation (examples reference sessions_spawn / WorkBuddy), but the skill does not declare any runtime dependencies, binaries, or environment variables (e.g., Playwright, browsers, or subagent endpoint credentials). An integrator will need to supply those separately.
Instruction Scope
Runtime instructions focus on defining specs, running a generator, running an independent evaluator that performs real interactions (Playwright/http), scoring, and looping. There are no instructions that read unrelated system files, exfiltrate secrets, or contact unexpected external endpoints. The included templates require implementers to replace placeholders with their actual agent invocation code.
Install Mechanism
This is an instruction-only skill with template scripts; there is no install spec and no downloads or external installers. The code files are templates and raise NotImplementedError where integrators must implement their agent calls — nothing will be executed automatically by the skill as provided.
Credentials
The skill declares no required environment variables or credentials, which is proportionate to an instructional/template skill. However, some evaluator templates (API evaluation) mention passing auth headers/tokens and the Playwright flow will typically require network access and potentially credentials for protected test environments. Those are not requested by the skill and must be supplied by the user when integrating; make sure any tokens needed for target artifacts are scoped and managed appropriately.
Persistence & Privilege
always:false and default autonomous invocation are set (normal). The skill does not request persistent presence or modify other skills. Because agent invocation hooks (run_generator/run_evaluator) are left for integrators to implement, the skill itself does not gain elevated privileges by itself.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install double-agent
  3. After installation, invoke the skill by name or use /double-agent
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Implements the Generator-Evaluator dual-agent architecture to eliminate AI self-evaluation bias. Evaluator uses Playwright to interact with the artifact like a real user, iterating 5-15 rounds to drive quality upward.
Metadata
Slug double-agent
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is DoubleAgent — Generator-Evaluator Dual Agent Pattern?

This skill should be used when designing, implementing, or improving any AI system that requires quality assurance through separation of generation and evalu... It is an AI Agent Skill for Claude Code / OpenClaw, with 102 downloads so far.

How do I install DoubleAgent — Generator-Evaluator Dual Agent Pattern?

Run "/install double-agent" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is DoubleAgent — Generator-Evaluator Dual Agent Pattern free?

Yes, DoubleAgent — Generator-Evaluator Dual Agent Pattern is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does DoubleAgent — Generator-Evaluator Dual Agent Pattern support?

DoubleAgent — Generator-Evaluator Dual Agent Pattern is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created DoubleAgent — Generator-Evaluator Dual Agent Pattern?

It is built and maintained by mingyuan (@zmy1006-sudo); the current version is v1.0.0.

💬 Comments