/install double-agent
\r \r
DoubleAgent Skill\r
\r
Purpose\r
\r The DoubleAgent pattern solves a fundamental problem in AI-generated software: AI self-evaluation bias.\r \r When a single AI agent both generates and evaluates its own output, it systematically overestimates quality — the same cognitive conflict that occurs when a student grades their own exam. The solution is to forcibly separate the two cognitive roles into independent agents with different prompts, goals, and evaluation criteria.\r \r This skill provides:\r
- Architecture templates for Generator-Evaluator agent pairs\r
- Evaluator prompt templates calibrated with few-shot scoring examples\r
- Iteration loop design for 5-15 round refinement cycles\r
- Playwright integration patterns for real browser-based evaluation\r
- Scoring rubric design to prevent score drift and grade inflation\r \r ---\r \r
Core Architecture\r
\r
User Goal / Spec\r
↓\r
┌─────────────┐\r
│ Generator │ ← Produces output (code, UI, content, data)\r
└──────┬──────┘\r
│ output artifact\r
↓\r
┌────────────────────────────────────┐\r
│ Evaluator │\r
│ • Reads spec (NOT generator output)│\r
│ • Operates artifact via Playwright │\r
│ (click, fill form, navigate) │\r
│ • Scores on rubric (0-100) │\r
│ • Writes structured feedback │\r
└────────────────┬───────────────────┘\r
│ score + feedback\r
↓\r
┌────────────────┐\r
│ Score ≥ target? │\r
│ YES → Done │\r
│ NO → Loop │\r
└────────┬────────┘\r
│\r
└──→ Generator (next iteration)\r
```\r
\r
**Key principle**: The Evaluator reads the **original spec**, not the Generator's output. It evaluates independently, as if it were a real user encountering the product for the first time.\r
\r
---\r
\r
## When to Apply\r
\r
| Scenario | Apply DoubleAgent? |\r
|----------|--------------------|\r
| AI-generated frontend UI with interactions | ✅ Yes |\r
| Multi-step workflow code (forms, flows) | ✅ Yes |\r
| API endpoint implementation + validation | ✅ Yes |\r
| Content generation (reports, copy, docs) | ✅ Yes (text-based evaluator) |\r
| Single-function refactoring | ⚠️ Optional |\r
| Simple config changes | ❌ Not needed |\r
\r
---\r
\r
## Implementation Steps\r
\r
### Step 1: Define the Spec Contract\r
\r
Write a clear spec that both agents will reference independently. The spec must be:\r
- Concrete (measurable outcomes, not vague goals)\r
- Observable (evaluable through interaction or inspection)\r
- Versioned (so both agents work from the same contract)\r
\r
See `references/architecture.md` for spec template.\r
\r
### Step 2: Configure the Generator Agent\r
\r
Assign the Generator a single role: **produce output that satisfies the spec**.\r
\r
- Do NOT ask the Generator to self-evaluate\r
- Do NOT include evaluation criteria in the Generator's prompt\r
- Provide: spec + iteration history + previous evaluator feedback\r
\r
### Step 3: Configure the Evaluator Agent\r
\r
Assign the Evaluator a single role: **independently verify the spec is satisfied**.\r
\r
- Load `references/evaluator-prompts.md` for calibrated prompt templates\r
- Use Playwright MCP for UI/web artifacts (real browser interaction)\r
- Use structured JSON output for scores to enable automated loop control\r
- Calibrate with few-shot examples BEFORE running (prevents grade inflation)\r
\r
### Step 4: Design the Iteration Loop\r
\r
```python\r
MAX_ROUNDS = 15\r
PASS_THRESHOLD = 80 # out of 100\r
\r
for round in range(MAX_ROUNDS):\r
output = generator.run(spec, history)\r
evaluation = evaluator.run(spec, output) # Playwright-based\r
\r
history.append({"round": round, "score": evaluation.score, "feedback": evaluation.feedback})\r
\r
if evaluation.score >= PASS_THRESHOLD:\r
break\r
\r
if evaluation.score_trend == "plateauing":\r
generator.switch_approach() # Complete strategy reset\r
```\r
\r
See `scripts/iteration_loop.py` for a complete implementation template.\r
\r
### Step 5: Calibrate the Evaluator\r
\r
To prevent score drift, run the Evaluator on 3-5 known examples FIRST:\r
- 1 example at ~30/100 (clearly bad)\r
- 1 example at ~60/100 (mediocre)\r
- 1 example at ~85/100 (good)\r
- 1 example at ~95/100 (excellent)\r
\r
If scores deviate >15 points from expected, adjust the Evaluator's prompt or rubric weights before the real run.\r
\r
---\r
\r
## Scoring Rubric Design\r
\r
Effective rubrics for software systems:\r
\r
| Dimension | Weight | What to Measure |\r
|-----------|--------|-----------------|\r
| Functional completeness | 30% | Does each spec requirement work end-to-end? |\r
| Interaction quality | 25% | Click/form/navigation behavior as a real user |\r
| Edge case handling | 20% | Error states, empty data, boundary inputs |\r
| Code/design quality | 15% | Consistency, readability, no obvious anti-patterns |\r
| Originality / craft | 10% | Avoids generic/template outputs when spec requires uniqueness |\r
\r
Adjust weights based on the domain. For content systems, increase "originality". For data pipelines, increase "edge case handling".\r
\r
---\r
\r
## Playwright Integration (for UI artifacts)\r
\r
When evaluating web/H5/mini-program outputs, the Evaluator should:\r
\r
1. **Navigate** to the deployed artifact URL\r
2. **Execute** each spec requirement as a user action sequence\r
3. **Observe** actual behavior (DOM state, network requests, visual output)\r
4. **Record** pass/fail per requirement with screenshots\r
5. **Report** structured JSON with score breakdown\r
\r
Playwright MCP tool calls to use:\r
- `playwright_navigate` → open URL\r
- `playwright_click` → interact with elements\r
- `playwright_fill` → fill form inputs\r
- `playwright_screenshot` → capture evidence\r
- `playwright_get_visible_text` → verify content\r
\r
---\r
\r
## Reference Files\r
\r
- `references/architecture.md` — Detailed architecture patterns, spec templates, and design rationale\r
- `references/evaluator-prompts.md` — Ready-to-use Evaluator prompt templates for different artifact types\r
\r
## Scripts\r
\r
- `scripts/iteration_loop.py` — Complete iteration loop implementation template\r
- `scripts/calibrate_evaluator.py` — Evaluator calibration utility\r
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install double-agent - After installation, invoke the skill by name or use
/double-agent - Provide required inputs per the skill's parameter spec and get structured output
What is DoubleAgent — Generator-Evaluator Dual Agent Pattern?
This skill should be used when designing, implementing, or improving any AI system that requires quality assurance through separation of generation and evalu... It is an AI Agent Skill for Claude Code / OpenClaw, with 102 downloads so far.
How do I install DoubleAgent — Generator-Evaluator Dual Agent Pattern?
Run "/install double-agent" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is DoubleAgent — Generator-Evaluator Dual Agent Pattern free?
Yes, DoubleAgent — Generator-Evaluator Dual Agent Pattern is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does DoubleAgent — Generator-Evaluator Dual Agent Pattern support?
DoubleAgent — Generator-Evaluator Dual Agent Pattern is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created DoubleAgent — Generator-Evaluator Dual Agent Pattern?
It is built and maintained by mingyuan (@zmy1006-sudo); the current version is v1.0.0.