Chapter 53

Reflexion: Self-Improvement Through Failure

Chapter 53: Reflexion: Self-Improvement Through Failure

Introduction

ReAct and Plan-and-Execute both assume the Agent can complete a task in one or a limited number of attempts. In reality, certain tasks—especially code generation, mathematical reasoning, and complex writing—are difficult to get right even with an excellent LLM on the first try. Reflexion provides an elegant answer: let the Agent learn from failures like a human expert, continuously improving through self-reflection until the task is complete. This chapter dives into the Reflexion paper's principles, its implementation in Hermes, and provides complete code examples.


53.1 Reflexion's Core: Verbal Reinforcement Learning

What Is Verbal Reinforcement Learning

In 2023, Noah Shinn et al. published the Reflexion paper, proposing a "learning" mechanism that requires no model weight updates: replacing numerical reward signals in traditional RL with verbal feedback (reflections).

Problems with traditional RL:

Reflexion's innovations:

flowchart TD
    subgraph Trial["Single Trial"]
        T1[Task] --> T2[Agent Executes]
        T2 --> T3{Task Succeeded?}
    end
    
    subgraph Reflect["Failure Reflection"]
        T3 -->|Failed| R1[Evaluator\nAssesses failure]
        R1 --> R2[Self-Reflection\nGenerates reflection text]
        R2 --> R3[Store in Memory]
    end
    
    subgraph Retry["Next Attempt"]
        R3 --> N1[New Agent instance]
        N1 --> N2[Load prior reflections]
        N2 --> N3[Execute task\navoiding known mistakes]
        N3 --> T3
    end
    
    T3 -->|Succeeded| DONE[Done]

Comparison with Traditional Methods

Method Learning Mechanism Needs Labeled Data Explains Failures Compute Cost
Fine-tuning Gradient descent Yes (lots) No Very High
RLHF Human feedback + RL Yes (manual) No High
Reflexion Verbal reflection (zero weight update) No Yes Low (inference only)
RAG External knowledge retrieval Yes (knowledge base) Partial Medium

53.2 The Fail → Reflect → Retry Loop

Phase 1: Trial Execution

The Agent executes the task using current strategy (plus any prior reflections). Execution can use ReAct or any other Agent mode.

Phase 2: Evaluation and Reflection

After execution, the Evaluator judges whether the result meets requirements. If not, Self-Reflection triggers:

Phase 3: Memory Update

The reflection is stored in external memory and injected into context on the next attempt.

# reflexion_memory.py
from dataclasses import dataclass, field
from typing import Optional
import time

@dataclass
class TrialRecord:
    trial_number: int
    task: str
    attempt: str
    result: str
    evaluation: str
    reflection: str
    success: bool
    timestamp: float = field(default_factory=time.time)
    
    def to_context_string(self) -> str:
        return f"""
[Trial {self.trial_number} - {'SUCCESS' if self.success else 'FAILED'}]
Output: {self.result[:300]}
Evaluation: {self.evaluation}
Reflection: {self.reflection}
"""


class ReflexionMemory:
    def __init__(self, max_trials: int = 5):
        self.trials: list = []
        self.max_trials = max_trials
    
    def add_trial(self, record: TrialRecord) -> None:
        self.trials.append(record)
    
    def get_reflections_context(self) -> str:
        failed = [t for t in self.trials if not t.success]
        if not failed:
            return ""
        header = f"## Prior Failure Records ({len(failed)} total)\nLearn from these failures:\n\n"
        records = "\n---\n".join([t.to_context_string() for t in failed])
        footer = "\n## Improvement Guidance\nBased on the above, actively avoid identified error patterns."
        return header + records + footer
    
    def can_retry(self) -> bool:
        return len(self.trials) < self.max_trials

53.3 Complete Reflexion Implementation

# reflexion_agent.py
import asyncio
from typing import Optional, Callable
from openai import AsyncOpenAI

ACTOR_SYSTEM = """You are an AI Agent that completes tasks carefully.

{reflection_context}

**Current Task**: Based on the failure records above (if any), improve your approach and complete the task. Avoid repeating known mistakes."""

EVALUATOR_SYSTEM = """You are a strict task evaluation expert.
Assess whether the Agent's output completely and correctly fulfills the task.

Output format:
- First: PASS or FAIL
- Then: evaluation rationale (1-3 sentences)
- If FAIL: identify the most critical problem"""

REFLECTION_SYSTEM = """You are a self-improvement expert.
Given a task, a failed attempt, and evaluation feedback, generate insightful reflection.

Requirements:
1. Analyze root cause (not surface symptoms)
2. Identify specific error patterns
3. Provide 2-3 actionable improvement suggestions
4. Be concise (3-5 sentences), focus on the most important issue"""


class Evaluator:
    def __init__(self, client: AsyncOpenAI, model: str, custom_evaluator: Optional[Callable] = None):
        self.client = client
        self.model = model
        self.custom_evaluator = custom_evaluator
    
    async def evaluate(self, task: str, output: str) -> tuple:
        if self.custom_evaluator:
            try:
                passed = self.custom_evaluator(output)
                return passed, "Custom evaluator passed" if passed else "Custom evaluator failed"
            except Exception as e:
                return False, f"Evaluator exception: {str(e)}"
        
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": EVALUATOR_SYSTEM},
                {"role": "user", "content": f"Task: {task}\n\nAgent output:\n{output}"}
            ],
            temperature=0.1,
            max_tokens=512
        )
        evaluation = response.choices[0].message.content
        passed = "PASS" in evaluation[:50].upper()
        return passed, evaluation


class SelfReflector:
    def __init__(self, client: AsyncOpenAI, model: str):
        self.client = client
        self.model = model
    
    async def reflect(self, task: str, attempt: str, result: str, evaluation: str) -> str:
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": REFLECTION_SYSTEM},
                {"role": "user", "content": f"""Task: {task}

My attempt summary: {attempt[:500]}
Actual output: {result[:500]}
Evaluation feedback: {evaluation}

Generate insightful self-reflection to help me avoid the same mistakes next time."""}
            ],
            temperature=0.3,
            max_tokens=512
        )
        return response.choices[0].message.content


class ReflexionAgent:
    """
    Reflexion Agent: continuously improves through fail-reflect-retry loops.
    """
    
    def __init__(
        self,
        model: str = "NousResearch/Hermes-3-Llama-3.1-8B",
        base_url: str = "http://localhost:8000/v1",
        api_key: str = "not-needed",
        max_trials: int = 5,
        custom_evaluator: Optional[Callable] = None
    ):
        self.client = AsyncOpenAI(base_url=base_url, api_key=api_key)
        self.model = model
        self.max_trials = max_trials
        self.memory = ReflexionMemory(max_trials=max_trials)
        self.evaluator = Evaluator(self.client, model, custom_evaluator)
        self.reflector = SelfReflector(self.client, model)
    
    async def _attempt_task(self, task: str, reflection_context: str) -> tuple:
        system = ACTOR_SYSTEM.format(reflection_context=reflection_context)
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": task}
            ],
            temperature=0.3,
            max_tokens=2048
        )
        output = response.choices[0].message.content or ""
        return f"Generated {len(output)} chars", output
    
    async def run(self, task: str) -> dict:
        print(f"\n{'='*60}\nReflexion Agent\nTask: {task[:80]}\nMax trials: {self.max_trials}\n{'='*60}")
        
        while self.memory.can_retry():
            trial_num = len(self.memory.trials) + 1
            print(f"\n--- Trial {trial_num} ---")
            
            reflection_context = self.memory.get_reflections_context()
            attempt_summary, output = await self._attempt_task(task, reflection_context)
            print(f"[Actor] Output ready ({len(output)} chars)")
            
            passed, evaluation = await self.evaluator.evaluate(task, output)
            print(f"[Evaluator] {'PASS' if passed else 'FAIL'}: {evaluation[:100]}")
            
            from reflexion_memory import TrialRecord
            
            if passed:
                record = TrialRecord(trial_num, task, attempt_summary, output, evaluation, "", True)
                self.memory.add_trial(record)
                print(f"\n[Reflexion] Trial {trial_num} succeeded!")
                return {
                    "success": True,
                    "answer": output,
                    "trials": trial_num,
                    "reflections": [t.reflection for t in self.memory.trials if t.reflection]
                }
            else:
                reflection = await self.reflector.reflect(task, attempt_summary, output, evaluation)
                print(f"[Reflector] {reflection[:150]}...")
                record = TrialRecord(trial_num, task, attempt_summary, output, evaluation, reflection, False)
                self.memory.add_trial(record)
        
        best = max(self.memory.trials, key=lambda t: len(t.result))
        return {
            "success": False,
            "answer": best.result,
            "trials": self.max_trials,
            "reflections": [t.reflection for t in self.memory.trials if t.reflection],
            "error": f"Task incomplete after {self.max_trials} trials"
        }


# Code generation evaluator using unit tests
def make_code_evaluator(test_cases: list) -> Callable:
    def evaluate(code_output: str) -> bool:
        import re
        code_match = re.search(r'```python\n(.*?)```', code_output, re.DOTALL)
        code = code_match.group(1) if code_match else code_output
        
        namespace = {}
        try:
            exec(code, namespace)
        except Exception as e:
            print(f"  Execution error: {e}")
            return False
        
        func = namespace.get('solution') or namespace.get('solve')
        if not func:
            print("  No solution/solve function found")
            return False
        
        for input_val, expected in test_cases:
            try:
                result = func(input_val)
                if result != expected:
                    print(f"  FAIL: solution({input_val}) = {result}, expected {expected}")
                    return False
                print(f"  PASS: solution({input_val}) = {result}")
            except Exception as e:
                print(f"  Test exception: {e}")
                return False
        return True
    return evaluate


async def main():
    test_cases = [([1, 2, 3, 4, 5], 15), ([10, -3, 7], 14), ([], 0)]
    
    agent = ReflexionAgent(
        max_trials=4,
        custom_evaluator=make_code_evaluator(test_cases)
    )
    
    result = await agent.run(
        "Write a Python function `solution(nums: list) -> int` that returns "
        "the sum of all positive integers in the list (ignoring negatives and zero)."
    )
    
    print(f"\nResult: {'Success' if result['success'] else 'Incomplete'}")
    print(f"Trials: {result['trials']}")
    print(f"Final answer:\n{result['answer']}")


if __name__ == "__main__":
    asyncio.run(main())

53.4 Applicable Scenarios

Best Fits for Reflexion

Scenario Why It Fits Evaluator Type
Code generation Clear unit tests; success/failure is unambiguous Unit tests (automated)
Mathematical reasoning Answer has right/wrong distinction Mathematical verifier
Structured data extraction Output format has strict requirements Schema validator
Logic puzzles Exactly one correct answer Rule checker
SQL generation Verifiable by execution result Database validator

Poor Fits

Reflexion's Limitations

  1. Token cost: Each failure requires additional reflection calls; costs scale with trials
  2. Evaluator quality is critical: LLM evaluators themselves may misjudge
  3. Context window pressure: Many failures mean many reflections competing for context space
  4. Cannot exceed model capability: If the task exceeds model ability, reflection cannot compensate

53.5 Relationship to Reinforcement Learning

Reflexion can be understood as "episodic" reinforcement learning:

Key difference from standard RL: policy improvement happens at inference time, not training time. This means Reflexion requires zero training infrastructure and can be deployed immediately.


Summary

This chapter provided an in-depth exploration of Reflexion's theory and practice:

  1. Verbal RL: Replaces numerical rewards with verbal reflections—no weight updates needed, low cost, impressive results.
  2. Three-loop mechanism: Trial → Evaluate → Reflect → Retry forms a self-improving closed loop.
  3. Code implementation: Complete Actor-Evaluator-Reflector three-component architecture supporting custom evaluators and unit test validation.
  4. Best scenarios: Code generation and mathematical reasoning are optimal because they have clear success/failure criteria.
  5. Limitations: Token cost, evaluator quality, and model capability ceiling are the three main constraints.

Review Questions

  1. If the evaluator itself is unreliable (20% misjudgment rate), how does it affect Reflexion's performance? How would you design a more robust evaluation mechanism?
  2. Reflexion stores reflections in context, which grows with each failure. How do you design a reflection compression strategy?
  3. Can Reflexion memory be shared across different tasks? For example, can "lessons from task A help solve task B"?
  4. What are the respective advantages of Reflexion and Fine-tuning? When should you switch from Reflexion to Fine-tuning?
Rate this chapter
4.9  / 5  (3 ratings)

💬 Comments