Chapter 15

Three-Layer Memory Architecture: Working, Episodic and Semantic Memory

Chapter 15: Three-Tier Memory Architecture — Working, Episodic, and Semantic Memory

Memory is the foundation of intelligence. An AI Agent without memory is like a person who wakes up amnesiac every day — unable to accumulate experience, build relationships, or grow. Hermes's three-tier memory architecture is the key design that gives AI a "temporal dimension."


15.1 Cognitive Science Foundations

15.1.1 From Human Memory to AI Memory

Human memory systems, shaped by millions of years of evolution, have a sophisticated layered structure:

Human Memory Tier Characteristics AI Equivalent
Working Memory Active information within current attention span, capacity ~7±2 chunks Context Window
Episodic Memory Personal experience sequences — "when, where, what happened" Session History
Semantic Memory Decontextualized knowledge — "facts, concepts, skills" Skill Library

These layers aren't arbitrary — each solves a fundamentally different problem:

15.1.2 Three-Tier Interaction

┌──────────────────────────────────────────────────────────┐
│               Three-Tier Memory Interaction              │
│                                                          │
│  ┌────────────────────────────────────────────────────┐  │
│  │            Working Memory (Context Window)          │  │
│  │     ← Up to 32K tokens →                           │  │
│  │                                                     │  │
│  │  [System Prompt] [MEMORY.md] [Skills] [History]    │  │
│  │                                                     │  │
│  │  ← Sacred Zone →           ← Compressible Zone →   │  │
│  └────────────────────────────────────────────────────┘  │
│              ↑ inject                  ↑ archive         │
│              │                         │                  │
│  ┌───────────┴──────┐    ┌─────────────┴────────┐       │
│  │  Semantic Memory │    │   Episodic Memory     │       │
│  │  (Skill Library) │    │   (Session History)   │       │
│  │                  │    │                       │       │
│  │  Skill names     │    │  Full session records │       │
│  │  Code templates  │    │  Timestamps           │       │
│  │  Usage stats     │    │  Task descriptions    │       │
│  │  Trigger conds   │    │  Key results          │       │
│  │                  │    │                       │       │
│  │  Store: VectorDB │    │  Store: SQLite        │       │
│  │  Retrieve: ANN   │    │  Retrieve: BM25+vec   │       │
│  └──────────────────┘    └───────────────────────┘       │
└──────────────────────────────────────────────────────────┘

15.2 Working Memory

15.2.1 Essence: Context Window Management

Working memory is everything the model can currently "see" — the full content of the context window. Hermes's working memory design centers on one core question: How do you place the most important information within a limited context window?

class WorkingMemory:
    def __init__(self, max_tokens: int = 32768):
        self.max_tokens = max_tokens
        self.slots = {
            "system_prompt": None,       # Fixed position (Sacred Zone)
            "memory_injection": None,    # MEMORY.md + Skill injection
            "conversation_history": [],  # Dialogue history
            "tool_results": []           # Tool execution results
        }
    
    def get_token_budget(self) -> dict:
        total = self.max_tokens
        return {
            "system_prompt":    int(total * 0.08),   # ~2.6K tokens
            "memory_injection": int(total * 0.12),   # ~3.9K tokens
            "conversation":     int(total * 0.30),   # ~9.8K tokens
            "tool_results":     int(total * 0.35),   # ~11.4K tokens
            "response_reserve": int(total * 0.15),   # ~4.9K tokens (model output)
        }

15.2.2 Context Window Token Distribution

┌──────────────────────────────────────────────────┐
│        32K Token Context Window Distribution     │
│                                                  │
│  ████ System prompt        (8%)   ≈ 2.6K tokens │
│  ████ Memory injection    (12%)   ≈ 3.9K tokens │
│  ████ Conversation history(30%)   ≈ 9.8K tokens │
│  ████ Tool results        (35%)  ≈ 11.4K tokens │
│  ░░░░ Output reserve      (15%)   ≈ 4.9K tokens │
│                                                  │
│  Effective utilization: ~85%                     │
└──────────────────────────────────────────────────┘

15.2.3 Key Strategies

Strategy 1: Prioritize Important Information at the Front

Research on both human cognition and LLMs shows better recall for content at the beginning and end of context ("position bias"). Hermes places critical information at the start:

# Correct: critical information at the front
context = [
    {"role": "system", "content": f"{SYSTEM_PROMPT}\n\n{INJECTED_SKILLS}"},
    {"role": "user", "content": user_message},
    # ... conversation history ...
]

# Wrong: critical information buried in the middle
context = [
    {"role": "user", "content": user_message},
    # ... very long tool results ...
    {"role": "system", "content": INJECTED_SKILLS},  # Too late — often "forgotten"
]

Strategy 2: Tool Result Compression

def compress_tool_result(result: str, max_tokens: int = 500) -> str:
    if count_tokens(result) <= max_tokens:
        return result
    
    if is_code_output(result):
        lines = result.split('\n')
        if len(lines) > 50:
            return '\n'.join(lines[:25]) + '\n... [truncated] ...\n' + '\n'.join(lines[-10:])
    
    if is_tabular_data(result):
        return create_data_summary(result)
    
    return truncate_to_tokens(result, max_tokens)

15.3 Episodic Memory

15.3.1 Data Model

@dataclass
class Episode:
    session_id: str
    created_at: datetime
    task_description: str          # Brief task description
    task_category: str             # Classified type
    steps_taken: int
    tools_used: List[str]
    execution_time_seconds: float
    success: bool
    key_result: str                # Key result summary (<200 chars)
    error_encountered: Optional[str]
    error_resolution: Optional[str]
    skills_applied: List[str]
    new_skill_created: Optional[str]
    embedding: Optional[List[float]] = None

15.3.2 Persistence Implementation

class EpisodicMemoryStore:
    def _init_db(self):
        with sqlite3.connect(self.db_path) as conn:
            conn.execute("""
                CREATE TABLE IF NOT EXISTS episodes (
                    session_id TEXT PRIMARY KEY,
                    created_at TIMESTAMP,
                    task_description TEXT,
                    task_category TEXT,
                    steps_taken INTEGER,
                    tools_used TEXT,          -- JSON array
                    execution_time_seconds REAL,
                    success BOOLEAN,
                    key_result TEXT,
                    error_encountered TEXT,
                    error_resolution TEXT,
                    skills_applied TEXT,      -- JSON array
                    new_skill_created TEXT,
                    embedding BLOB
                )
            """)
            conn.execute("CREATE INDEX IF NOT EXISTS idx_created_at ON episodes(created_at)")
    
    def search_similar(self, query: str, top_k: int = 5) -> List[Episode]:
        """Hybrid retrieval: BM25 keyword + vector similarity, fused with RRF"""
        keyword_results = self._bm25_search(query, limit=top_k * 2)
        query_embedding = self.embedding_model.encode(query)
        vector_results = self._vector_search(query_embedding, limit=top_k * 2)
        return self._reciprocal_rank_fusion(keyword_results, vector_results, top_k)

15.3.3 Episodic Memory Injection

class MemoryInjector:
    async def inject_episodic_memory(self, task: str) -> str:
        relevant_episodes = await self.episodic_store.search_similar(query=task, top_k=3)
        
        # Only inject successful episodes or those with documented error resolutions
        valuable_episodes = [
            ep for ep in relevant_episodes
            if ep.success or ep.error_resolution is not None
        ]
        
        if not valuable_episodes:
            return ""
        
        inject_lines = ["## Relevant Historical Experience"]
        for ep in valuable_episodes[:2]:
            status = "successful" if ep.success else "failed but resolved"
            inject_lines.append(
                f"- **Similar task** ({status}, {ep.created_at.strftime('%b %d')}): {ep.key_result}"
            )
            if ep.error_resolution:
                inject_lines.append(f"  - **Pitfall**: {ep.error_encountered} → {ep.error_resolution}")
        
        return "\n".join(inject_lines)

15.4 Semantic Memory: The Skill Library

15.4.1 Core Value

Semantic memory is the most economically valuable tier — transforming temporary execution experience into reusable, composable skill units:

Experience → Distillation → Skill

One successful "data cleaning" task
    ↓ extracted
"csv_data_cleaning" Skill (with code template)
    ↓ reused
50th similar task: applied directly, 3× faster

15.4.2 Complete Skill Data Model

@dataclass
class Skill:
    id: str
    name: str                            # Short name, underscore-separated
    version: int = 1
    description: str = ""
    trigger_conditions: List[str] = field(default_factory=list)
    code_template: str = ""              # Executable template with {param} placeholders
    natural_language_steps: str = ""
    parameters: Dict[str, str] = field(default_factory=dict)
    required_parameters: List[str] = field(default_factory=list)
    dependencies: List[str] = field(default_factory=list)
    usage_count: int = 0
    success_rate: float = 1.0
    pitfalls: List[str] = field(default_factory=list)
    created_at: datetime = field(default_factory=datetime.now)
    last_used_at: Optional[datetime] = None
    health_status: str = "active"        # active | deprecated | needs_review
    embedding: Optional[List[float]] = None

15.4.3 Skill Retrieval System

class SkillRetriever:
    async def retrieve_relevant_skills(self, task: str, top_k: int = 5) -> List[Skill]:
        """Hybrid retrieval strategy (recommended)"""
        semantic_results = await self.vector_store.search(task, top_k * 2)
        keyword_results = await self.sqlite_store.keyword_search(
            extract_technical_terms(task), top_k * 2
        )
        
        all_skills = self._deduplicate(semantic_results + keyword_results)
        scored_skills = [
            (skill, self._compute_retrieval_score(skill, task))
            for skill in all_skills
        ]
        scored_skills.sort(key=lambda x: x[1], reverse=True)
        return [skill for skill, _ in scored_skills[:top_k]]
    
    def _compute_retrieval_score(self, skill: Skill, task: str) -> float:
        """
        Score = semantic_similarity × quality_weight + frequency_bonus + health_penalty
        """
        semantic_score = self._semantic_similarity(skill, task)
        frequency_bonus = min(skill.usage_count / 100, 0.2)   # max +0.2
        quality_weight = skill.success_rate
        health_penalty = -0.3 if skill.health_status == "deprecated" else 0
        return semantic_score * quality_weight + frequency_bonus + health_penalty

15.5 Cross-Session Memory Persistence

15.5.1 Persistence Architecture

┌─────────────────────────────────────────────────────┐
│          Cross-Session Persistence Architecture     │
│                                                     │
│  Session A ends              Session B begins       │
│      │                           │                  │
│      ↓                           ↓                  │
│  ┌─────────────────┐       ┌──────────────────┐    │
│  │ Auto-archive    │       │ Auto-load         │    │
│  │ - Save episode  │       │ - Read MEMORY.md  │    │
│  │ - Extract skill │       │ - Retrieve skills │    │
│  │ - State snapshot│       │ - Retrieve episode│    │
│  └────────┬────────┘       └──────┬────────────┘    │
│           │                        ↑                 │
│           └──────── Persistent Storage ──────────────┘
│                │                                     │
│    ┌───────────┼───────────┐                         │
│    ↓           ↓           ↓                         │
│  SQLite    VectorDB    Filesystem                   │
│ (episodes) (skills)   (MEMORY.md)                  │
└─────────────────────────────────────────────────────┘

15.5.2 Context Restoration

class SessionRestorer:
    async def restore_context(self, new_session: Session) -> str:
        task = new_session.initial_task
        
        # 1. Load MEMORY.md (highest priority)
        memory_md = self._load_memory_md()
        
        # 2. Retrieve relevant skills
        relevant_skills = await self.skill_retriever.retrieve_relevant_skills(task, top_k=5)
        
        # 3. Retrieve relevant episodes
        relevant_episodes = await self.episodic_store.search_similar(query=task, top_k=3)
        
        # 4. Assemble restoration context
        context_parts = []
        if memory_md:
            context_parts.append(f"## User-Defined Memory\n{memory_md}")
        if relevant_skills:
            context_parts.append(f"## Available Skills (from past experience)\n{self._format_skills(relevant_skills)}")
        if relevant_episodes:
            context_parts.append(f"## Relevant Historical Experience\n{self._format_episodes(relevant_episodes)}")
        
        return "\n\n---\n\n".join(context_parts)

15.6 MEMORY.md Injection Mechanism

15.6.1 Design Philosophy

MEMORY.md is a unique Hermes mechanism — allowing users to manually inject persistent context information into the model. This file is automatically injected into the system prompt at the start of every conversation.

# Typical MEMORY.md content example

## User Preferences
- Language preference: Python (primary) > JavaScript > Go
- Code style: PEP 8, type annotations, detailed comments
- Output format: Markdown preferred, syntax-highlighted code blocks

## Project Background
- Current main project: YiteAI blog platform (Next.js + SQLite)
- Repository path: /Users/hexin/code/yiteai/
- Production: Ubuntu 22.04 VPS, domain dev.yiteai.com

## Database Info
- Primary DB: SQLite, path /data/yiteai.db
- Schema: see /docs/schema.md

## Common Commands
- Start dev server: cd ~/code/yiteai && npm run dev
- Deploy: ./scripts/deploy.sh production

15.6.2 MEMORY.md Injection Flow

class MemoryMdInjector:
    def load_with_cache(self) -> str:
        """Cache-based MEMORY.md loading (avoids I/O on every request)"""
        current_mtime = os.path.getmtime(self.memory_md_path)
        if self._cache is None or current_mtime > self._cache_mtime:
            with open(self.memory_md_path, 'r', encoding='utf-8') as f:
                self._cache = f.read()
            self._cache_mtime = current_mtime
        return self._cache
    
    def inject_into_system_prompt(self, base_system_prompt: str) -> str:
        memory_content = self.load_with_cache()
        if not memory_content.strip():
            return base_system_prompt
        
        return f"""{base_system_prompt}

---

## User Persistent Memory (MEMORY.md)

{memory_content}

---

The above is the user's persistent context information. Please reference this when processing tasks."""
    
    async def auto_update(self, session: Session, llm_client) -> bool:
        """After conversation ends, let LLM decide if MEMORY.md needs updating"""
        update_check_prompt = f"""
Current MEMORY.md content:
{self.load_with_cache()}

This conversation discovered the following new information:
{session.get_notable_discoveries()}

Decide if MEMORY.md should be updated to record important new information.
If yes, output the complete updated MEMORY.md content.
If no update needed, output "NO_UPDATE".
"""
        response = await llm_client.generate(update_check_prompt, max_tokens=2000)
        
        if response.strip() != "NO_UPDATE":
            with open(self.memory_md_path, 'w', encoding='utf-8') as f:
                f.write(response.strip())
            self._cache = None
            return True
        return False

15.6.3 MEMORY.md Placement Strategy

MEMORY.md content resides in the Sacred Zone of the system prompt (see Chapter 16), ensuring:

  1. It is never removed by the compression mechanism
  2. It always appears at the front of the context window
  3. It has persistent guidance effect on model behavior

15.7 Complete Three-Tier Memory Coordination Example

async def complete_memory_workflow(agent: HermesAgent, task: str):
    """Shows how the three memory tiers work together"""
    session = agent.create_session()
    
    # ─── Phase 1: Session startup ─────────────────────────
    
    working_memory = agent.working_memory.initialize()
    
    # Inject MEMORY.md (user persistent memory)
    memory_md = agent.memory_md_injector.load_with_cache()
    working_memory.inject_section("user_memory", memory_md)
    
    # Retrieve and inject semantic memory (relevant skills)
    relevant_skills = await agent.skill_retriever.retrieve_relevant_skills(task, top_k=5)
    working_memory.inject_section("skills", format_skills(relevant_skills))
    
    # Retrieve and inject episodic memory (relevant history)
    relevant_episodes = await agent.episodic_store.search_similar(task, top_k=3)
    working_memory.inject_section("episodes", format_episodes(relevant_episodes))
    
    # ─── Phase 2: Task execution ────────────────────────────
    result = await agent.execute_task(task, session, working_memory)
    
    # ─── Phase 3: Session cleanup ─────────────────────────
    
    # Archive episodic memory
    episode = await agent.episodic_builder.build_episode(session)
    agent.episodic_store.save_episode(episode)
    
    # Extract and save semantic memory (new skill)
    new_skill = await agent.skill_extractor.extract(session)
    if new_skill:
        await agent.skill_store.save(new_skill)
    
    # Auto-update MEMORY.md if needed
    await agent.memory_md_injector.auto_update(session, agent.llm)
    
    return result

Chapter Summary

Discussion Questions

  1. The working memory token budget (system prompt 8%, skills 12%, conversation 30%, tool results 35%) — how was it determined? For code-intensive tasks, should these ratios be adjusted?
  2. Episodic memory uses BM25 + vector hybrid retrieval (RRF fusion). In what scenarios is pure vector retrieval superior? In what scenarios is pure keyword retrieval superior?
  3. MEMORY.md allows user manual editing, but the LLM can also auto-update it. How do you prevent the LLM from accidentally deleting important information during auto-updates?
  4. What "forgetting strategy" should each memory tier implement? When episodic memory exceeds 10,000 entries, which episodes should be prioritized for deletion?
Rate this chapter
4.8  / 5  (25 ratings)

💬 Comments