Chapter 53

Reflexion: Self-Improvement Through Failure

Chapter 53: Reflexion: Self-Improvement Through Failure

Introduction

ReAct and Plan-and-Execute both assume the Agent can complete a task in one or a limited number of attempts. In reality, certain tasks—especially code generation, mathematical reasoning, and complex writing—are difficult to get right even with an excellent LLM on the first try. Reflexion provides an elegant answer: let the Agent learn from failures like a human expert, continuously improving through self-reflection until the task is complete. This chapter dives into the Reflexion paper's principles, its implementation in Hermes, and provides complete code examples.

53.1 Reflexion's Core: Verbal Reinforcement Learning

What Is Verbal Reinforcement Learning

In 2023, Noah Shinn et al. published the Reflexion paper, proposing a "learning" mechanism that requires no model weight updates: replacing numerical reward signals in traditional RL with verbal feedback (reflections).

Problems with traditional RL:

Requires many samples to converge
Reward function design is difficult
Cannot directly explain why something failed

Reflexion's innovations:

Converts reward signals into verbal reflections
Stores reflections in external episodic memory
On the next attempt, reflections are injected as context to guide improvement

flowchart TD
    subgraph Trial["Single Trial"]
        T1[Task] --> T2[Agent Executes]
        T2 --> T3{Task Succeeded?}
    end
    
    subgraph Reflect["Failure Reflection"]
        T3 -->|Failed| R1[Evaluator\nAssesses failure]
        R1 --> R2[Self-Reflection\nGenerates reflection text]
        R2 --> R3[Store in Memory]
    end
    
    subgraph Retry["Next Attempt"]
        R3 --> N1[New Agent instance]
        N1 --> N2[Load prior reflections]
        N2 --> N3[Execute task\navoiding known mistakes]
        N3 --> T3
    end
    
    T3 -->|Succeeded| DONE[Done]

Comparison with Traditional Methods

Method	Learning Mechanism	Needs Labeled Data	Explains Failures	Compute Cost
Fine-tuning	Gradient descent	Yes (lots)	No	Very High
RLHF	Human feedback + RL	Yes (manual)	No	High
Reflexion	Verbal reflection (zero weight update)	No	Yes	Low (inference only)
RAG	External knowledge retrieval	Yes (knowledge base)	Partial	Medium

53.2 The Fail → Reflect → Retry Loop

Phase 1: Trial Execution

The Agent executes the task using current strategy (plus any prior reflections). Execution can use ReAct or any other Agent mode.

Phase 2: Evaluation and Reflection

After execution, the Evaluator judges whether the result meets requirements. If not, Self-Reflection triggers:

Analyze root cause (logical error? insufficient information? misunderstood requirements?)
Summarize lessons from this attempt
Propose specific, actionable improvements for next try

Phase 3: Memory Update

The reflection is stored in external memory and injected into context on the next attempt.

# reflexion_memory.py
from dataclasses import dataclass, field
from typing import Optional
import time

@dataclass
class TrialRecord:
    trial_number: int
    task: str
    attempt: str
    result: str
    evaluation: str
    reflection: str
    success: bool
    timestamp: float = field(default_factory=time.time)
    
    def to_context_string(self) -> str:
        return f"""
[Trial {self.trial_number} - {'SUCCESS' if self.success else 'FAILED'}]
Output: {self.result[:300]}
Evaluation: {self.evaluation}
Reflection: {self.reflection}
"""


class ReflexionMemory:
    def __init__(self, max_trials: int = 5):
        self.trials: list = []
        self.max_trials = max_trials
    
    def add_trial(self, record: TrialRecord) -> None:
        self.trials.append(record)
    
    def get_reflections_context(self) -> str:
        failed = [t for t in self.trials if not t.success]
        if not failed:
            return ""
        header = f"## Prior Failure Records ({len(failed)} total)\nLearn from these failures:\n\n"
        records = "\n---\n".join([t.to_context_string() for t in failed])
        footer = "\n## Improvement Guidance\nBased on the above, actively avoid identified error patterns."
        return header + records + footer
    
    def can_retry(self) -> bool:
        return len(self.trials) < self.max_trials

53.3 Complete Reflexion Implementation

# reflexion_agent.py
import asyncio
from typing import Optional, Callable
from openai import AsyncOpenAI

ACTOR_SYSTEM = """You are an AI Agent that completes tasks carefully.

{reflection_context}

**Current Task**: Based on the failure records above (if any), improve your approach and complete the task. Avoid repeating known mistakes."""

EVALUATOR_SYSTEM = """You are a strict task evaluation expert.
Assess whether the Agent's output completely and correctly fulfills the task.

Output format:
- First: PASS or FAIL
- Then: evaluation rationale (1-3 sentences)
- If FAIL: identify the most critical problem"""

REFLECTION_SYSTEM = """You are a self-improvement expert.
Given a task, a failed attempt, and evaluation feedback, generate insightful reflection.

Requirements:
1. Analyze root cause (not surface symptoms)
2. Identify specific error patterns
3. Provide 2-3 actionable improvement suggestions
4. Be concise (3-5 sentences), focus on the most important issue"""


class Evaluator:
    def __init__(self, client: AsyncOpenAI, model: str, custom_evaluator: Optional[Callable] = None):
        self.client = client
        self.model = model
        self.custom_evaluator = custom_evaluator
    
    async def evaluate(self, task: str, output: str) -> tuple:
        if self.custom_evaluator:
            try:
                passed = self.custom_evaluator(output)
                return passed, "Custom evaluator passed" if passed else "Custom evaluator failed"
            except Exception as e:
                return False, f"Evaluator exception: {str(e)}"
        
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": EVALUATOR_SYSTEM},
                {"role": "user", "content": f"Task: {task}\n\nAgent output:\n{output}"}
            ],
            temperature=0.1,
            max_tokens=512
        )
        evaluation = response.choices[0].message.content
        passed = "PASS" in evaluation[:50].upper()
        return passed, evaluation


class SelfReflector:
    def __init__(self, client: AsyncOpenAI, model: str):
        self.client = client
        self.model = model
    
    async def reflect(self, task: str, attempt: str, result: str, evaluation: str) -> str:
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": REFLECTION_SYSTEM},
                {"role": "user", "content": f"""Task: {task}

My attempt summary: {attempt[:500]}
Actual output: {result[:500]}
Evaluation feedback: {evaluation}

Generate insightful self-reflection to help me avoid the same mistakes next time."""}
            ],
            temperature=0.3,
            max_tokens=512
        )
        return response.choices[0].message.content


class ReflexionAgent:
    """
    Reflexion Agent: continuously improves through fail-reflect-retry loops.
    """
    
    def __init__(
        self,
        model: str = "NousResearch/Hermes-3-Llama-3.1-8B",
        base_url: str = "http://localhost:8000/v1",
        api_key: str = "not-needed",
        max_trials: int = 5,
        custom_evaluator: Optional[Callable] = None
    ):
        self.client = AsyncOpenAI(base_url=base_url, api_key=api_key)
        self.model = model
        self.max_trials = max_trials
        self.memory = ReflexionMemory(max_trials=max_trials)
        self.evaluator = Evaluator(self.client, model, custom_evaluator)
        self.reflector = SelfReflector(self.client, model)
    
    async def _attempt_task(self, task: str, reflection_context: str) -> tuple:
        system = ACTOR_SYSTEM.format(reflection_context=reflection_context)
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": task}
            ],
            temperature=0.3,
            max_tokens=2048
        )
        output = response.choices[0].message.content or ""
        return f"Generated {len(output)} chars", output
    
    async def run(self, task: str) -> dict:
        print(f"\n{'='*60}\nReflexion Agent\nTask: {task[:80]}\nMax trials: {self.max_trials}\n{'='*60}")
        
        while self.memory.can_retry():
            trial_num = len(self.memory.trials) + 1
            print(f"\n--- Trial {trial_num} ---")
            
            reflection_context = self.memory.get_reflections_context()
            attempt_summary, output = await self._attempt_task(task, reflection_context)
            print(f"[Actor] Output ready ({len(output)} chars)")
            
            passed, evaluation = await self.evaluator.evaluate(task, output)
            print(f"[Evaluator] {'PASS' if passed else 'FAIL'}: {evaluation[:100]}")
            
            from reflexion_memory import TrialRecord
            
            if passed:
                record = TrialRecord(trial_num, task, attempt_summary, output, evaluation, "", True)
                self.memory.add_trial(record)
                print(f"\n[Reflexion] Trial {trial_num} succeeded!")
                return {
                    "success": True,
                    "answer": output,
                    "trials": trial_num,
                    "reflections": [t.reflection for t in self.memory.trials if t.reflection]
                }
            else:
                reflection = await self.reflector.reflect(task, attempt_summary, output, evaluation)
                print(f"[Reflector] {reflection[:150]}...")
                record = TrialRecord(trial_num, task, attempt_summary, output, evaluation, reflection, False)
                self.memory.add_trial(record)
        
        best = max(self.memory.trials, key=lambda t: len(t.result))
        return {
            "success": False,
            "answer": best.result,
            "trials": self.max_trials,
            "reflections": [t.reflection for t in self.memory.trials if t.reflection],
            "error": f"Task incomplete after {self.max_trials} trials"
        }


# Code generation evaluator using unit tests
def make_code_evaluator(test_cases: list) -> Callable:
    def evaluate(code_output: str) -> bool:
        import re
        code_match = re.search(r'```python\n(.*?)```', code_output, re.DOTALL)
        code = code_match.group(1) if code_match else code_output
        
        namespace = {}
        try:
            exec(code, namespace)
        except Exception as e:
            print(f"  Execution error: {e}")
            return False
        
        func = namespace.get('solution') or namespace.get('solve')
        if not func:
            print("  No solution/solve function found")
            return False
        
        for input_val, expected in test_cases:
            try:
                result = func(input_val)
                if result != expected:
                    print(f"  FAIL: solution({input_val}) = {result}, expected {expected}")
                    return False
                print(f"  PASS: solution({input_val}) = {result}")
            except Exception as e:
                print(f"  Test exception: {e}")
                return False
        return True
    return evaluate


async def main():
    test_cases = [([1, 2, 3, 4, 5], 15), ([10, -3, 7], 14), ([], 0)]
    
    agent = ReflexionAgent(
        max_trials=4,
        custom_evaluator=make_code_evaluator(test_cases)
    )
    
    result = await agent.run(
        "Write a Python function `solution(nums: list) -> int` that returns "
        "the sum of all positive integers in the list (ignoring negatives and zero)."
    )
    
    print(f"\nResult: {'Success' if result['success'] else 'Incomplete'}")
    print(f"Trials: {result['trials']}")
    print(f"Final answer:\n{result['answer']}")


if __name__ == "__main__":
    asyncio.run(main())

53.4 Applicable Scenarios

Best Fits for Reflexion

Scenario	Why It Fits	Evaluator Type
Code generation	Clear unit tests; success/failure is unambiguous	Unit tests (automated)
Mathematical reasoning	Answer has right/wrong distinction	Mathematical verifier
Structured data extraction	Output format has strict requirements	Schema validator
Logic puzzles	Exactly one correct answer	Rule checker
SQL generation	Verifiable by execution result	Database validator

Poor Fits

Creative writing (lacks objective evaluation criteria)
Open-ended Q&A (answers are diverse, "success" hard to define)
Simple single-step tasks (reflection overhead exceeds benefit)

Reflexion's Limitations

Token cost: Each failure requires additional reflection calls; costs scale with trials
Evaluator quality is critical: LLM evaluators themselves may misjudge
Context window pressure: Many failures mean many reflections competing for context space
Cannot exceed model capability: If the task exceeds model ability, reflection cannot compensate

53.5 Relationship to Reinforcement Learning

Reflexion can be understood as "episodic" reinforcement learning:

State: Current task + episodic memory of reflections
Action: Agent's output
Reward: Evaluator's PASS/FAIL signal
Policy update: Adding verbal reflections to memory (not updating weights)

Key difference from standard RL: policy improvement happens at inference time, not training time. This means Reflexion requires zero training infrastructure and can be deployed immediately.

Summary

This chapter provided an in-depth exploration of Reflexion's theory and practice:

Verbal RL: Replaces numerical rewards with verbal reflections—no weight updates needed, low cost, impressive results.
Three-loop mechanism: Trial → Evaluate → Reflect → Retry forms a self-improving closed loop.
Code implementation: Complete Actor-Evaluator-Reflector three-component architecture supporting custom evaluators and unit test validation.
Best scenarios: Code generation and mathematical reasoning are optimal because they have clear success/failure criteria.
Limitations: Token cost, evaluator quality, and model capability ceiling are the three main constraints.

Review Questions

If the evaluator itself is unreliable (20% misjudgment rate), how does it affect Reflexion's performance? How would you design a more robust evaluation mechanism?
Reflexion stores reflections in context, which grows with each failure. How do you design a reflection compression strategy?
Can Reflexion memory be shared across different tasks? For example, can "lessons from task A help solve task B"?
What are the respective advantages of Reflexion and Fine-tuning? When should you switch from Reflexion to Fine-tuning?

Rate this chapter

4.9 / 5 (3 ratings)