Three-Layer Memory Architecture: Working, Episodic and Semantic Memory
Chapter 15: Three-Tier Memory Architecture — Working, Episodic, and Semantic Memory
Memory is the foundation of intelligence. An AI Agent without memory is like a person who wakes up amnesiac every day — unable to accumulate experience, build relationships, or grow. Hermes's three-tier memory architecture is the key design that gives AI a "temporal dimension."
15.1 Cognitive Science Foundations
15.1.1 From Human Memory to AI Memory
Human memory systems, shaped by millions of years of evolution, have a sophisticated layered structure:
| Human Memory Tier | Characteristics | AI Equivalent |
|---|---|---|
| Working Memory | Active information within current attention span, capacity ~7±2 chunks | Context Window |
| Episodic Memory | Personal experience sequences — "when, where, what happened" | Session History |
| Semantic Memory | Decontextualized knowledge — "facts, concepts, skills" | Skill Library |
These layers aren't arbitrary — each solves a fundamentally different problem:
- Working memory addresses current processing capacity
- Episodic memory addresses sequential experience accumulation
- Semantic memory addresses generalizable knowledge
15.1.2 Three-Tier Interaction
┌──────────────────────────────────────────────────────────┐
│ Three-Tier Memory Interaction │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Working Memory (Context Window) │ │
│ │ ← Up to 32K tokens → │ │
│ │ │ │
│ │ [System Prompt] [MEMORY.md] [Skills] [History] │ │
│ │ │ │
│ │ ← Sacred Zone → ← Compressible Zone → │ │
│ └────────────────────────────────────────────────────┘ │
│ ↑ inject ↑ archive │
│ │ │ │
│ ┌───────────┴──────┐ ┌─────────────┴────────┐ │
│ │ Semantic Memory │ │ Episodic Memory │ │
│ │ (Skill Library) │ │ (Session History) │ │
│ │ │ │ │ │
│ │ Skill names │ │ Full session records │ │
│ │ Code templates │ │ Timestamps │ │
│ │ Usage stats │ │ Task descriptions │ │
│ │ Trigger conds │ │ Key results │ │
│ │ │ │ │ │
│ │ Store: VectorDB │ │ Store: SQLite │ │
│ │ Retrieve: ANN │ │ Retrieve: BM25+vec │ │
│ └──────────────────┘ └───────────────────────┘ │
└──────────────────────────────────────────────────────────┘
15.2 Working Memory
15.2.1 Essence: Context Window Management
Working memory is everything the model can currently "see" — the full content of the context window. Hermes's working memory design centers on one core question: How do you place the most important information within a limited context window?
class WorkingMemory:
def __init__(self, max_tokens: int = 32768):
self.max_tokens = max_tokens
self.slots = {
"system_prompt": None, # Fixed position (Sacred Zone)
"memory_injection": None, # MEMORY.md + Skill injection
"conversation_history": [], # Dialogue history
"tool_results": [] # Tool execution results
}
def get_token_budget(self) -> dict:
total = self.max_tokens
return {
"system_prompt": int(total * 0.08), # ~2.6K tokens
"memory_injection": int(total * 0.12), # ~3.9K tokens
"conversation": int(total * 0.30), # ~9.8K tokens
"tool_results": int(total * 0.35), # ~11.4K tokens
"response_reserve": int(total * 0.15), # ~4.9K tokens (model output)
}
15.2.2 Context Window Token Distribution
┌──────────────────────────────────────────────────┐
│ 32K Token Context Window Distribution │
│ │
│ ████ System prompt (8%) ≈ 2.6K tokens │
│ ████ Memory injection (12%) ≈ 3.9K tokens │
│ ████ Conversation history(30%) ≈ 9.8K tokens │
│ ████ Tool results (35%) ≈ 11.4K tokens │
│ ░░░░ Output reserve (15%) ≈ 4.9K tokens │
│ │
│ Effective utilization: ~85% │
└──────────────────────────────────────────────────┘
15.2.3 Key Strategies
Strategy 1: Prioritize Important Information at the Front
Research on both human cognition and LLMs shows better recall for content at the beginning and end of context ("position bias"). Hermes places critical information at the start:
# Correct: critical information at the front
context = [
{"role": "system", "content": f"{SYSTEM_PROMPT}\n\n{INJECTED_SKILLS}"},
{"role": "user", "content": user_message},
# ... conversation history ...
]
# Wrong: critical information buried in the middle
context = [
{"role": "user", "content": user_message},
# ... very long tool results ...
{"role": "system", "content": INJECTED_SKILLS}, # Too late — often "forgotten"
]
Strategy 2: Tool Result Compression
def compress_tool_result(result: str, max_tokens: int = 500) -> str:
if count_tokens(result) <= max_tokens:
return result
if is_code_output(result):
lines = result.split('\n')
if len(lines) > 50:
return '\n'.join(lines[:25]) + '\n... [truncated] ...\n' + '\n'.join(lines[-10:])
if is_tabular_data(result):
return create_data_summary(result)
return truncate_to_tokens(result, max_tokens)
15.3 Episodic Memory
15.3.1 Data Model
@dataclass
class Episode:
session_id: str
created_at: datetime
task_description: str # Brief task description
task_category: str # Classified type
steps_taken: int
tools_used: List[str]
execution_time_seconds: float
success: bool
key_result: str # Key result summary (<200 chars)
error_encountered: Optional[str]
error_resolution: Optional[str]
skills_applied: List[str]
new_skill_created: Optional[str]
embedding: Optional[List[float]] = None
15.3.2 Persistence Implementation
class EpisodicMemoryStore:
def _init_db(self):
with sqlite3.connect(self.db_path) as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS episodes (
session_id TEXT PRIMARY KEY,
created_at TIMESTAMP,
task_description TEXT,
task_category TEXT,
steps_taken INTEGER,
tools_used TEXT, -- JSON array
execution_time_seconds REAL,
success BOOLEAN,
key_result TEXT,
error_encountered TEXT,
error_resolution TEXT,
skills_applied TEXT, -- JSON array
new_skill_created TEXT,
embedding BLOB
)
""")
conn.execute("CREATE INDEX IF NOT EXISTS idx_created_at ON episodes(created_at)")
def search_similar(self, query: str, top_k: int = 5) -> List[Episode]:
"""Hybrid retrieval: BM25 keyword + vector similarity, fused with RRF"""
keyword_results = self._bm25_search(query, limit=top_k * 2)
query_embedding = self.embedding_model.encode(query)
vector_results = self._vector_search(query_embedding, limit=top_k * 2)
return self._reciprocal_rank_fusion(keyword_results, vector_results, top_k)
15.3.3 Episodic Memory Injection
class MemoryInjector:
async def inject_episodic_memory(self, task: str) -> str:
relevant_episodes = await self.episodic_store.search_similar(query=task, top_k=3)
# Only inject successful episodes or those with documented error resolutions
valuable_episodes = [
ep for ep in relevant_episodes
if ep.success or ep.error_resolution is not None
]
if not valuable_episodes:
return ""
inject_lines = ["## Relevant Historical Experience"]
for ep in valuable_episodes[:2]:
status = "successful" if ep.success else "failed but resolved"
inject_lines.append(
f"- **Similar task** ({status}, {ep.created_at.strftime('%b %d')}): {ep.key_result}"
)
if ep.error_resolution:
inject_lines.append(f" - **Pitfall**: {ep.error_encountered} → {ep.error_resolution}")
return "\n".join(inject_lines)
15.4 Semantic Memory: The Skill Library
15.4.1 Core Value
Semantic memory is the most economically valuable tier — transforming temporary execution experience into reusable, composable skill units:
Experience → Distillation → Skill
One successful "data cleaning" task
↓ extracted
"csv_data_cleaning" Skill (with code template)
↓ reused
50th similar task: applied directly, 3× faster
15.4.2 Complete Skill Data Model
@dataclass
class Skill:
id: str
name: str # Short name, underscore-separated
version: int = 1
description: str = ""
trigger_conditions: List[str] = field(default_factory=list)
code_template: str = "" # Executable template with {param} placeholders
natural_language_steps: str = ""
parameters: Dict[str, str] = field(default_factory=dict)
required_parameters: List[str] = field(default_factory=list)
dependencies: List[str] = field(default_factory=list)
usage_count: int = 0
success_rate: float = 1.0
pitfalls: List[str] = field(default_factory=list)
created_at: datetime = field(default_factory=datetime.now)
last_used_at: Optional[datetime] = None
health_status: str = "active" # active | deprecated | needs_review
embedding: Optional[List[float]] = None
15.4.3 Skill Retrieval System
class SkillRetriever:
async def retrieve_relevant_skills(self, task: str, top_k: int = 5) -> List[Skill]:
"""Hybrid retrieval strategy (recommended)"""
semantic_results = await self.vector_store.search(task, top_k * 2)
keyword_results = await self.sqlite_store.keyword_search(
extract_technical_terms(task), top_k * 2
)
all_skills = self._deduplicate(semantic_results + keyword_results)
scored_skills = [
(skill, self._compute_retrieval_score(skill, task))
for skill in all_skills
]
scored_skills.sort(key=lambda x: x[1], reverse=True)
return [skill for skill, _ in scored_skills[:top_k]]
def _compute_retrieval_score(self, skill: Skill, task: str) -> float:
"""
Score = semantic_similarity × quality_weight + frequency_bonus + health_penalty
"""
semantic_score = self._semantic_similarity(skill, task)
frequency_bonus = min(skill.usage_count / 100, 0.2) # max +0.2
quality_weight = skill.success_rate
health_penalty = -0.3 if skill.health_status == "deprecated" else 0
return semantic_score * quality_weight + frequency_bonus + health_penalty
15.5 Cross-Session Memory Persistence
15.5.1 Persistence Architecture
┌─────────────────────────────────────────────────────┐
│ Cross-Session Persistence Architecture │
│ │
│ Session A ends Session B begins │
│ │ │ │
│ ↓ ↓ │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ Auto-archive │ │ Auto-load │ │
│ │ - Save episode │ │ - Read MEMORY.md │ │
│ │ - Extract skill │ │ - Retrieve skills │ │
│ │ - State snapshot│ │ - Retrieve episode│ │
│ └────────┬────────┘ └──────┬────────────┘ │
│ │ ↑ │
│ └──────── Persistent Storage ──────────────┘
│ │ │
│ ┌───────────┼───────────┐ │
│ ↓ ↓ ↓ │
│ SQLite VectorDB Filesystem │
│ (episodes) (skills) (MEMORY.md) │
└─────────────────────────────────────────────────────┘
15.5.2 Context Restoration
class SessionRestorer:
async def restore_context(self, new_session: Session) -> str:
task = new_session.initial_task
# 1. Load MEMORY.md (highest priority)
memory_md = self._load_memory_md()
# 2. Retrieve relevant skills
relevant_skills = await self.skill_retriever.retrieve_relevant_skills(task, top_k=5)
# 3. Retrieve relevant episodes
relevant_episodes = await self.episodic_store.search_similar(query=task, top_k=3)
# 4. Assemble restoration context
context_parts = []
if memory_md:
context_parts.append(f"## User-Defined Memory\n{memory_md}")
if relevant_skills:
context_parts.append(f"## Available Skills (from past experience)\n{self._format_skills(relevant_skills)}")
if relevant_episodes:
context_parts.append(f"## Relevant Historical Experience\n{self._format_episodes(relevant_episodes)}")
return "\n\n---\n\n".join(context_parts)
15.6 MEMORY.md Injection Mechanism
15.6.1 Design Philosophy
MEMORY.md is a unique Hermes mechanism — allowing users to manually inject persistent context information into the model. This file is automatically injected into the system prompt at the start of every conversation.
# Typical MEMORY.md content example
## User Preferences
- Language preference: Python (primary) > JavaScript > Go
- Code style: PEP 8, type annotations, detailed comments
- Output format: Markdown preferred, syntax-highlighted code blocks
## Project Background
- Current main project: YiteAI blog platform (Next.js + SQLite)
- Repository path: /Users/hexin/code/yiteai/
- Production: Ubuntu 22.04 VPS, domain dev.yiteai.com
## Database Info
- Primary DB: SQLite, path /data/yiteai.db
- Schema: see /docs/schema.md
## Common Commands
- Start dev server: cd ~/code/yiteai && npm run dev
- Deploy: ./scripts/deploy.sh production
15.6.2 MEMORY.md Injection Flow
class MemoryMdInjector:
def load_with_cache(self) -> str:
"""Cache-based MEMORY.md loading (avoids I/O on every request)"""
current_mtime = os.path.getmtime(self.memory_md_path)
if self._cache is None or current_mtime > self._cache_mtime:
with open(self.memory_md_path, 'r', encoding='utf-8') as f:
self._cache = f.read()
self._cache_mtime = current_mtime
return self._cache
def inject_into_system_prompt(self, base_system_prompt: str) -> str:
memory_content = self.load_with_cache()
if not memory_content.strip():
return base_system_prompt
return f"""{base_system_prompt}
---
## User Persistent Memory (MEMORY.md)
{memory_content}
---
The above is the user's persistent context information. Please reference this when processing tasks."""
async def auto_update(self, session: Session, llm_client) -> bool:
"""After conversation ends, let LLM decide if MEMORY.md needs updating"""
update_check_prompt = f"""
Current MEMORY.md content:
{self.load_with_cache()}
This conversation discovered the following new information:
{session.get_notable_discoveries()}
Decide if MEMORY.md should be updated to record important new information.
If yes, output the complete updated MEMORY.md content.
If no update needed, output "NO_UPDATE".
"""
response = await llm_client.generate(update_check_prompt, max_tokens=2000)
if response.strip() != "NO_UPDATE":
with open(self.memory_md_path, 'w', encoding='utf-8') as f:
f.write(response.strip())
self._cache = None
return True
return False
15.6.3 MEMORY.md Placement Strategy
MEMORY.md content resides in the Sacred Zone of the system prompt (see Chapter 16), ensuring:
- It is never removed by the compression mechanism
- It always appears at the front of the context window
- It has persistent guidance effect on model behavior
15.7 Complete Three-Tier Memory Coordination Example
async def complete_memory_workflow(agent: HermesAgent, task: str):
"""Shows how the three memory tiers work together"""
session = agent.create_session()
# ─── Phase 1: Session startup ─────────────────────────
working_memory = agent.working_memory.initialize()
# Inject MEMORY.md (user persistent memory)
memory_md = agent.memory_md_injector.load_with_cache()
working_memory.inject_section("user_memory", memory_md)
# Retrieve and inject semantic memory (relevant skills)
relevant_skills = await agent.skill_retriever.retrieve_relevant_skills(task, top_k=5)
working_memory.inject_section("skills", format_skills(relevant_skills))
# Retrieve and inject episodic memory (relevant history)
relevant_episodes = await agent.episodic_store.search_similar(task, top_k=3)
working_memory.inject_section("episodes", format_episodes(relevant_episodes))
# ─── Phase 2: Task execution ────────────────────────────
result = await agent.execute_task(task, session, working_memory)
# ─── Phase 3: Session cleanup ─────────────────────────
# Archive episodic memory
episode = await agent.episodic_builder.build_episode(session)
agent.episodic_store.save_episode(episode)
# Extract and save semantic memory (new skill)
new_skill = await agent.skill_extractor.extract(session)
if new_skill:
await agent.skill_store.save(new_skill)
# Auto-update MEMORY.md if needed
await agent.memory_md_injector.auto_update(session, agent.llm)
return result
Chapter Summary
- Three tiers solve different problems: working memory (current processing), episodic memory (sequential experience), semantic memory (generalizable knowledge)
- Working memory uses precise token budget allocation to prioritize the most important information in the limited context window
- Episodic memory uses BM25 + vector hybrid retrieval to find the most relevant historical experience
- Semantic memory (Skill library) uses multi-strategy retrieval with composite scoring to select skills best suited to the current task
- MEMORY.md is a user-controllable persistent memory injection mechanism located in the Sacred Zone, never compressed
- All three tiers inject cooperatively at session start and archive cooperatively at session end
Discussion Questions
- The working memory token budget (system prompt 8%, skills 12%, conversation 30%, tool results 35%) — how was it determined? For code-intensive tasks, should these ratios be adjusted?
- Episodic memory uses BM25 + vector hybrid retrieval (RRF fusion). In what scenarios is pure vector retrieval superior? In what scenarios is pure keyword retrieval superior?
- MEMORY.md allows user manual editing, but the LLM can also auto-update it. How do you prevent the LLM from accidentally deleting important information during auto-updates?
- What "forgetting strategy" should each memory tier implement? When episodic memory exceeds 10,000 entries, which episodes should be prioritized for deletion?