Refinement Loop (Opus 4.8 Edition)
/install refinement-loop
Refinement Loop (Opus 4.8 Edition)
A refinement loop improves an output it can't get right in one shot: it generates a candidate, evaluates it against explicit criteria, revises based on that evaluation, and repeats until the output is good enough.
Before building a loop: ask whether a single Opus call with extended thinking already produces acceptable output. Opus 4.8's thinking blocks perform internal multi-step self-critique before committing to a response. One well-crafted prompt + thinking=on often outperforms a sloppy three-pass loop. Refine only when quality genuinely benefits from critique the generator couldn't apply to itself in one go.
When to use vs. not
Use a refinement loop when:
- Quality has a ceiling a single pass won't reach, and you can articulate what "better" means with a rubric.
- You have (or can write) concrete evaluation criteria — a checklist, test cases, objective checks.
- The improvement is worth the token cost. Opus 4.8 loops are expensive. Budget before you loop.
Don't use one when:
- A single Opus call with thinking=on already gets there. Test first.
- You can't define what good looks like. Without a real evaluation signal, the loop churns.
- The task is purely subjective and a human is the only meaningful judge — loop the human's feedback, not an AI critic's.
- Cost will exceed the value of the improvement.
Opus 4.8-Specific Patterns
Pattern 1 — Thinking-as-Critic (single model, one call per pass)
Opus 4.8 with thinking enabled performs internal deliberation before responding. That thinking block IS a critique pass. Structure your prompt so the thinking does the evaluative work:
System: You are an expert [domain] writer and critic.
Think through this step by step:
1. Draft a response to the requirements below.
2. Critique your draft against this rubric: [rubric].
3. Identify the top 2–3 specific failures.
4. Revise your draft to fix them.
5. Output only the final revised version.
Requirements: [requirements]
This collapses the generator and critic into one Opus call. Use this as pass 1. Only escalate to a multi-call loop if the output still falls short.
Pattern 2 — Model Routing (Sonnet critic, Opus generator)
When you need a true multi-call loop, don't run Opus on every step. Route by role:
| Role | Model | Rationale |
|---|---|---|
| Generator (pass 1) | Opus 4.8 + thinking | Best first draft |
| Critic (all passes) | Claude Sonnet | Fast, cheap, accurate at rubric evaluation |
| Reviser (passes 2+) | Sonnet or Opus | Sonnet if rubric-mechanical; Opus if creative/complex |
| Final pass | Opus 4.8 + thinking | Polish and coherence check |
This cuts loop cost by 60–80% vs. running Opus on every step.
Pattern 3 — Thinking Block as Critique Extractor
When using Opus as critic, instruct it to surface the critique in the thinking and output only a structured critique object. The thinking block will be far more honest and thorough than the visible response (Opus tends to soften visible criticism):
System: You are a strict critic. Do not produce the revised artifact.
Output only a JSON critique object:
{
"score": \x3C0-10>,
"failures": ["specific failure 1", "specific failure 2", ...],
"converged": \x3Ctrue if no meaningful improvements remain>
}
Rubric: [rubric]
Artifact: [artifact]
The Five Organs
Specify all five explicitly; a vague version of any one breaks the loop.
- Iteration step — one generate-or-revise pass producing a fresh candidate.
- State — original requirements + current best candidate + latest critique. Re-supply requirements every pass to prevent drift.
- Feedback signal — specific, actionable critique tied to rubric criteria. This is the engine. Weak critique = no improvement.
- Stopping condition — bar met, converged, or max iterations hit.
- Safeguard — keep-best tracking (not just last) + hard iteration cap + cost cap.
Control Structure
budget_tokens = 0
MAX_TOKENS = 50_000 # set before you start; abort if exceeded
MAX_ITERS = 4 # rarely need more; Opus is strong
best = opus_generate(requirements, thinking=True) # Pattern 1 first
score, critique = sonnet_evaluate(best, rubric) # cheap critic
budget_tokens += estimate_tokens(best, critique)
i = 0
while score \x3C BAR and i \x3C MAX_ITERS:
if budget_tokens > MAX_TOKENS:
break # cost abort — return best seen so far
candidate = sonnet_revise(best, critique, requirements)
new_score, new_critique = sonnet_evaluate(candidate, rubric)
budget_tokens += estimate_tokens(candidate, new_critique)
if new_critique.get("converged"):
break # model says no meaningful improvements remain
if semantic_similarity(candidate, best) > 0.97:
break # text stopped changing — convergence
if new_score > score:
best, score, critique = candidate, new_score, new_critique
i += 1
# Optional: final Opus polish pass if budget allows
if budget_tokens + OPUS_POLISH_COST \x3C MAX_TOKENS:
best = opus_polish(best, requirements, thinking=True)
return best
Convergence Detection (Specific)
- Critic says converged — ask the critic to set
"converged": truewhen the rubric has no remaining actionable failures. - Semantic similarity — embed both candidates and check cosine similarity. Stop when similarity > 0.97.
- Score delta — if
new_score - score \x3C 0.5(on a 10-point scale) for two consecutive passes, stop. - Hard cap — MAX_ITERS always fires. Never skip this.
Combine: stop when any one triggers.
The Evaluation Rubric
The critique must be specific and actionable, not a grade.
- Bad: "7/10, could be tighter."
- Good: "The second paragraph repeats the thesis from paragraph one; cut it. The claim about adoption rates has no source. The closing sentence is passive — make it a direct call to action."
Pass the full rubric to the critic every round. Where the artifact allows objective checks (code passes tests, JSON validates, under word limit), use those — far stronger than prose judgments.
Role Separation
Run generation and evaluation as separate roles — different prompts, different instructions, ideally different models (see Pattern 2). A critic operating in the same breath that just produced the text tends to rubber-stamp it.
With Opus's thinking-as-critic pattern (Pattern 1), the thinking block provides enough adversarial distance. When visible-output critique is too soft, switch to Pattern 3.
State and Drift Prevention
Carry three things between passes: original requirements, current best candidate, latest critique. Re-supply the original requirements every round.
With Opus 4.8's 200k context, pass the full history of all passes. This helps the model see the trajectory and avoid re-introducing earlier mistakes.
Cost Controls
- Set a
MAX_TOKENSbudget before the loop begins. Abort and return best if exceeded. - Use model routing (Pattern 2) — Sonnet does critique and revision; Opus only on pass 1 and optional final polish.
- Log token usage per pass. If a pass costs more than all prior passes combined, something is wrong.
- For prompt refinement loops, test on a small sample first before running the full eval set.
Temperature
- Generation pass (Opus): 1.0 with thinking enabled. Thinking provides diversity.
- Revision passes (Sonnet): 0.3–0.5. Deliberate changes, not random.
- Critic passes: 0.1. Deterministic evaluation.
Failure Modes and Mitigations
| Failure | Mitigation |
|---|---|
| Sycophantic critic | Separate critic role, concrete rubric, Pattern 3, objective checks |
| Drift from original goal | Re-supply requirements every pass |
| Over-correction | Critique against full rubric every round; keep-best |
| Mode collapse / blandness | Cap iterations; Opus final polish pass |
| Final ≠ best | Track and return highest-scoring, never blindly the last |
| Infinite churn | MAX_ITERS + convergence detection (all three signals) |
| Cost blowout | MAX_TOKENS budget cap + model routing |
| Looping when one prompt would do | Run Opus+thinking single call first |
| Vague convergence | Use all three convergence signals |
Worked Example A: Refine a Piece of Writing
Goal: tight 150-word product blurb. Rubric: under 150 words, leads with benefit, one concrete proof point, active voice, ends on CTA.
- Opus pass 1 (Pattern 1) → 148 words, benefit-first, active, CTA but no proof point. Score 7/10.
- Sonnet critic → "Missing proof point." Not converged.
- Sonnet revise → adds a stat, 147 words. Score 9/10. Converged. → 2 Opus calls, 2 Sonnet calls.
Worked Example B: Refine a Prompt
Goal: prompt that extracts {name, date, total} as JSON from invoices.
- Generate prompt → test on 5 invoices → prose leaks, dates not normalized.
- Sonnet critic → "3/5 fail JSON-only output. Add JSON-only instruction and date format spec."
- Sonnet revise → adds instruction, date format, one-shot example → 5/5 pass. Bar met. → 1 Opus call, 2 Sonnet calls.
Design Checklist
- Tried single Opus+thinking call first — confirmed it falls short
- Original requirements written down and re-supplied every pass
- Explicit rubric / criteria (objective checks where possible)
- MAX_TOKENS budget set before loop starts
- Model routing configured (Sonnet critic/reviser, Opus generator)
- Separate critic role with instructions to find specific, actionable faults
- Reviser anchored to original requirements, not just last critique
- Keep-best tracking (return highest-scoring, not last)
- Stopping condition: bar met OR converged (all 3 signals) OR max-iters OR cost cap
- MAX_ITERS cap (4 is usually enough)
- Temperature set per role (1.0 Opus generation, 0.3–0.5 revision, 0.1 critic)
- Token usage logged per pass
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install refinement-loop - After installation, invoke the skill by name or use
/refinement-loop - Provide required inputs per the skill's parameter spec and get structured output
What is Refinement Loop (Opus 4.8 Edition)?
Design and run iterative generate→critique→revise loops optimized for Claude Opus 4.8, with thinking-as-critic, cost controls, and model routing. It is an AI Agent Skill for Claude Code / OpenClaw, with 48 downloads so far.
How do I install Refinement Loop (Opus 4.8 Edition)?
Run "/install refinement-loop" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Refinement Loop (Opus 4.8 Edition) free?
Yes, Refinement Loop (Opus 4.8 Edition) is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Refinement Loop (Opus 4.8 Edition) support?
Refinement Loop (Opus 4.8 Edition) is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Refinement Loop (Opus 4.8 Edition)?
It is built and maintained by Prometheus-Prime (@prometheus-prime); the current version is v1.0.0.