Reflexion: Self-Improvement Through Failure
Chapter 53: Reflexion: Self-Improvement Through Failure
Introduction
ReAct and Plan-and-Execute both assume the Agent can complete a task in one or a limited number of attempts. In reality, certain tasksโespecially code generation, mathematical reasoning, and complex writingโare difficult to get right even with an excellent LLM on the first try. Reflexion provides an elegant answer: let the Agent learn from failures like a human expert, continuously improving through self-reflection until the task is complete. This chapter dives into the Reflexion paper's principles, its implementation in Hermes, and provides complete code examples.
53.1 Reflexion's Core: Verbal Reinforcement Learning
What Is Verbal Reinforcement Learning
In 2023, Noah Shinn et al. published the Reflexion paper, proposing a "learning" mechanism that requires no model weight updates: replacing numerical reward signals in traditional RL with verbal feedback (reflections).
Problems with traditional RL:
- Requires many samples to converge
- Reward function design is difficult
- Cannot directly explain why something failed
Reflexion's innovations:
- Converts reward signals into verbal reflections
- Stores reflections in external episodic memory
- On the next attempt, reflections are injected as context to guide improvement
flowchart TD
subgraph Trial["Single Trial"]
T1[Task] --> T2[Agent Executes]
T2 --> T3{Task Succeeded?}
end
subgraph Reflect["Failure Reflection"]
T3 -->|Failed| R1[Evaluator\nAssesses failure]
R1 --> R2[Self-Reflection\nGenerates reflection text]
R2 --> R3[Store in Memory]
end
subgraph Retry["Next Attempt"]
R3 --> N1[New Agent instance]
N1 --> N2[Load prior reflections]
N2 --> N3[Execute task\navoiding known mistakes]
N3 --> T3
end
T3 -->|Succeeded| DONE[Done]
Comparison with Traditional Methods
| Method | Learning Mechanism | Needs Labeled Data | Explains Failures | Compute Cost |
|---|---|---|---|---|
| Fine-tuning | Gradient descent | Yes (lots) | No | Very High |
| RLHF | Human feedback + RL | Yes (manual) | No | High |
| Reflexion | Verbal reflection (zero weight update) | No | Yes | Low (inference only) |
| RAG | External knowledge retrieval | Yes (knowledge base) | Partial | Medium |
53.2 The Fail โ Reflect โ Retry Loop
Phase 1: Trial Execution
The Agent executes the task using current strategy (plus any prior reflections). Execution can use ReAct or any other Agent mode.
Phase 2: Evaluation and Reflection
After execution, the Evaluator judges whether the result meets requirements. If not, Self-Reflection triggers:
- Analyze root cause (logical error? insufficient information? misunderstood requirements?)
- Summarize lessons from this attempt
- Propose specific, actionable improvements for next try
Phase 3: Memory Update
The reflection is stored in external memory and injected into context on the next attempt.
# reflexion_memory.py
from dataclasses import dataclass, field
from typing import Optional
import time
@dataclass
class TrialRecord:
trial_number: int
task: str
attempt: str
result: str
evaluation: str
reflection: str
success: bool
timestamp: float = field(default_factory=time.time)
def to_context_string(self) -> str:
return f"""
[Trial {self.trial_number} - {'SUCCESS' if self.success else 'FAILED'}]
Output: {self.result[:300]}
Evaluation: {self.evaluation}
Reflection: {self.reflection}
"""
class ReflexionMemory:
def __init__(self, max_trials: int = 5):
self.trials: list = []
self.max_trials = max_trials
def add_trial(self, record: TrialRecord) -> None:
self.trials.append(record)
def get_reflections_context(self) -> str:
failed = [t for t in self.trials if not t.success]
if not failed:
return ""
header = f"## Prior Failure Records ({len(failed)} total)\nLearn from these failures:\n\n"
records = "\n---\n".join([t.to_context_string() for t in failed])
footer = "\n## Improvement Guidance\nBased on the above, actively avoid identified error patterns."
return header + records + footer
def can_retry(self) -> bool:
return len(self.trials) < self.max_trials
53.3 Complete Reflexion Implementation
# reflexion_agent.py
import asyncio
from typing import Optional, Callable
from openai import AsyncOpenAI
ACTOR_SYSTEM = """You are an AI Agent that completes tasks carefully.
{reflection_context}
**Current Task**: Based on the failure records above (if any), improve your approach and complete the task. Avoid repeating known mistakes."""
EVALUATOR_SYSTEM = """You are a strict task evaluation expert.
Assess whether the Agent's output completely and correctly fulfills the task.
Output format:
- First: PASS or FAIL
- Then: evaluation rationale (1-3 sentences)
- If FAIL: identify the most critical problem"""
REFLECTION_SYSTEM = """You are a self-improvement expert.
Given a task, a failed attempt, and evaluation feedback, generate insightful reflection.
Requirements:
1. Analyze root cause (not surface symptoms)
2. Identify specific error patterns
3. Provide 2-3 actionable improvement suggestions
4. Be concise (3-5 sentences), focus on the most important issue"""
class Evaluator:
def __init__(self, client: AsyncOpenAI, model: str, custom_evaluator: Optional[Callable] = None):
self.client = client
self.model = model
self.custom_evaluator = custom_evaluator
async def evaluate(self, task: str, output: str) -> tuple:
if self.custom_evaluator:
try:
passed = self.custom_evaluator(output)
return passed, "Custom evaluator passed" if passed else "Custom evaluator failed"
except Exception as e:
return False, f"Evaluator exception: {str(e)}"
response = await self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": EVALUATOR_SYSTEM},
{"role": "user", "content": f"Task: {task}\n\nAgent output:\n{output}"}
],
temperature=0.1,
max_tokens=512
)
evaluation = response.choices[0].message.content
passed = "PASS" in evaluation[:50].upper()
return passed, evaluation
class SelfReflector:
def __init__(self, client: AsyncOpenAI, model: str):
self.client = client
self.model = model
async def reflect(self, task: str, attempt: str, result: str, evaluation: str) -> str:
response = await self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": REFLECTION_SYSTEM},
{"role": "user", "content": f"""Task: {task}
My attempt summary: {attempt[:500]}
Actual output: {result[:500]}
Evaluation feedback: {evaluation}
Generate insightful self-reflection to help me avoid the same mistakes next time."""}
],
temperature=0.3,
max_tokens=512
)
return response.choices[0].message.content
class ReflexionAgent:
"""
Reflexion Agent: continuously improves through fail-reflect-retry loops.
"""
def __init__(
self,
model: str = "NousResearch/Hermes-3-Llama-3.1-8B",
base_url: str = "http://localhost:8000/v1",
api_key: str = "not-needed",
max_trials: int = 5,
custom_evaluator: Optional[Callable] = None
):
self.client = AsyncOpenAI(base_url=base_url, api_key=api_key)
self.model = model
self.max_trials = max_trials
self.memory = ReflexionMemory(max_trials=max_trials)
self.evaluator = Evaluator(self.client, model, custom_evaluator)
self.reflector = SelfReflector(self.client, model)
async def _attempt_task(self, task: str, reflection_context: str) -> tuple:
system = ACTOR_SYSTEM.format(reflection_context=reflection_context)
response = await self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": task}
],
temperature=0.3,
max_tokens=2048
)
output = response.choices[0].message.content or ""
return f"Generated {len(output)} chars", output
async def run(self, task: str) -> dict:
print(f"\n{'='*60}\nReflexion Agent\nTask: {task[:80]}\nMax trials: {self.max_trials}\n{'='*60}")
while self.memory.can_retry():
trial_num = len(self.memory.trials) + 1
print(f"\n--- Trial {trial_num} ---")
reflection_context = self.memory.get_reflections_context()
attempt_summary, output = await self._attempt_task(task, reflection_context)
print(f"[Actor] Output ready ({len(output)} chars)")
passed, evaluation = await self.evaluator.evaluate(task, output)
print(f"[Evaluator] {'PASS' if passed else 'FAIL'}: {evaluation[:100]}")
from reflexion_memory import TrialRecord
if passed:
record = TrialRecord(trial_num, task, attempt_summary, output, evaluation, "", True)
self.memory.add_trial(record)
print(f"\n[Reflexion] Trial {trial_num} succeeded!")
return {
"success": True,
"answer": output,
"trials": trial_num,
"reflections": [t.reflection for t in self.memory.trials if t.reflection]
}
else:
reflection = await self.reflector.reflect(task, attempt_summary, output, evaluation)
print(f"[Reflector] {reflection[:150]}...")
record = TrialRecord(trial_num, task, attempt_summary, output, evaluation, reflection, False)
self.memory.add_trial(record)
best = max(self.memory.trials, key=lambda t: len(t.result))
return {
"success": False,
"answer": best.result,
"trials": self.max_trials,
"reflections": [t.reflection for t in self.memory.trials if t.reflection],
"error": f"Task incomplete after {self.max_trials} trials"
}
# Code generation evaluator using unit tests
def make_code_evaluator(test_cases: list) -> Callable:
def evaluate(code_output: str) -> bool:
import re
code_match = re.search(r'```python\n(.*?)```', code_output, re.DOTALL)
code = code_match.group(1) if code_match else code_output
namespace = {}
try:
exec(code, namespace)
except Exception as e:
print(f" Execution error: {e}")
return False
func = namespace.get('solution') or namespace.get('solve')
if not func:
print(" No solution/solve function found")
return False
for input_val, expected in test_cases:
try:
result = func(input_val)
if result != expected:
print(f" FAIL: solution({input_val}) = {result}, expected {expected}")
return False
print(f" PASS: solution({input_val}) = {result}")
except Exception as e:
print(f" Test exception: {e}")
return False
return True
return evaluate
async def main():
test_cases = [([1, 2, 3, 4, 5], 15), ([10, -3, 7], 14), ([], 0)]
agent = ReflexionAgent(
max_trials=4,
custom_evaluator=make_code_evaluator(test_cases)
)
result = await agent.run(
"Write a Python function `solution(nums: list) -> int` that returns "
"the sum of all positive integers in the list (ignoring negatives and zero)."
)
print(f"\nResult: {'Success' if result['success'] else 'Incomplete'}")
print(f"Trials: {result['trials']}")
print(f"Final answer:\n{result['answer']}")
if __name__ == "__main__":
asyncio.run(main())
53.4 Applicable Scenarios
Best Fits for Reflexion
| Scenario | Why It Fits | Evaluator Type |
|---|---|---|
| Code generation | Clear unit tests; success/failure is unambiguous | Unit tests (automated) |
| Mathematical reasoning | Answer has right/wrong distinction | Mathematical verifier |
| Structured data extraction | Output format has strict requirements | Schema validator |
| Logic puzzles | Exactly one correct answer | Rule checker |
| SQL generation | Verifiable by execution result | Database validator |
Poor Fits
- Creative writing (lacks objective evaluation criteria)
- Open-ended Q&A (answers are diverse, "success" hard to define)
- Simple single-step tasks (reflection overhead exceeds benefit)
Reflexion's Limitations
- Token cost: Each failure requires additional reflection calls; costs scale with trials
- Evaluator quality is critical: LLM evaluators themselves may misjudge
- Context window pressure: Many failures mean many reflections competing for context space
- Cannot exceed model capability: If the task exceeds model ability, reflection cannot compensate
53.5 Relationship to Reinforcement Learning
Reflexion can be understood as "episodic" reinforcement learning:
- State: Current task + episodic memory of reflections
- Action: Agent's output
- Reward: Evaluator's PASS/FAIL signal
- Policy update: Adding verbal reflections to memory (not updating weights)
Key difference from standard RL: policy improvement happens at inference time, not training time. This means Reflexion requires zero training infrastructure and can be deployed immediately.
Summary
This chapter provided an in-depth exploration of Reflexion's theory and practice:
- Verbal RL: Replaces numerical rewards with verbal reflectionsโno weight updates needed, low cost, impressive results.
- Three-loop mechanism: Trial โ Evaluate โ Reflect โ Retry forms a self-improving closed loop.
- Code implementation: Complete Actor-Evaluator-Reflector three-component architecture supporting custom evaluators and unit test validation.
- Best scenarios: Code generation and mathematical reasoning are optimal because they have clear success/failure criteria.
- Limitations: Token cost, evaluator quality, and model capability ceiling are the three main constraints.
Review Questions
- If the evaluator itself is unreliable (20% misjudgment rate), how does it affect Reflexion's performance? How would you design a more robust evaluation mechanism?
- Reflexion stores reflections in context, which grows with each failure. How do you design a reflection compression strategy?
- Can Reflexion memory be shared across different tasks? For example, can "lessons from task A help solve task B"?
- What are the respective advantages of Reflexion and Fine-tuning? When should you switch from Reflexion to Fine-tuning?