Multi-Session Software Development Pattern: Three-Phase Architecture of Initializer / Subsequent / End-of-session
Chapter 28: RAG Architecture with Claude: Engineering Practice of Retrieval-Augmented Generation
28.1 The Core Problem RAG Solves
Large language models have a fundamental tension at their core: their knowledge is frozen at training time, their context windows have hard token limits, yet the knowledge bases that organizations need to query grow without bound.
Retrieval-Augmented Generation (RAG) resolves this tension by dynamically retrieving the most relevant document fragments at query time and injecting only those into Claude's context. The model never needs to "know" your entire knowledge base — it only needs to reason over the relevant portions it receives.
RAG vs Fine-Tuning vs Long Context
| Approach | Best For | Strengths | Weaknesses |
|---|---|---|---|
| RAG | Frequently updated knowledge, factual Q&A | Real-time updates, explainable, cost-efficient | Quality depends on retrieval engineering |
| Fine-tuning | Fixed style, format, or domain behavior | Knowledge "internalized" in model weights | High update cost, no real-time refresh |
| Long-context injection | Small document sets (<10 docs) | Simple implementation | High token cost, diluted attention |
For most enterprise knowledge base scenarios — thousands to hundreds of thousands of documents — RAG is the most practical engineering choice.
28.2 RAG System Architecture
┌────────────────────────────────────────────────────────┐
│ Offline Indexing Pipeline │
│ │
│ Raw Docs → Load → Chunk → Embed → VectorDB Store │
│ (PDF/MD) (Loader) (Chunker) (Model) (Qdrant/etc.) │
└────────────────────────────────────────────────────────┘
│
Vector Database
│
┌────────────────────────────────────────────────────────┐
│ Online Query Pipeline │
│ │
│ Query → Rewrite → Embed → Retrieve → Rerank → Context │
│ (optional) (Top-K) (Cross-enc) Build │
│ │ │
│ Claude API │
│ │ │
│ Final Answer │
└────────────────────────────────────────────────────────┘
28.3 Chunking Strategies
Chunking is the most overlooked yet highest-impact component of RAG. Poor chunking causes truncated semantics, redundant overlaps, and retrieval failures regardless of how good the embedding model is.
Strategy 1: Fixed-Size Chunking (Baseline)
def fixed_size_chunk(text: str, chunk_size: int = 512,
overlap: int = 64) -> list[str]:
"""Simple sliding window chunking"""
chunks = []
start = 0
while start < len(text):
chunks.append(text[start:start + chunk_size])
start += chunk_size - overlap
return chunks
Fast and deterministic, but ignores natural document structure.
Strategy 2: Recursive Chunking (Recommended Default)
def recursive_chunk(text: str, chunk_size: int = 1000,
overlap: int = 100,
separators: list[str] | None = None) -> list[str]:
"""
Split on natural boundaries in priority order:
paragraph > newline > sentence > word > character
"""
if separators is None:
separators = ["\n\n", "\n", ". ", "! ", "? ", " ", ""]
def _split(text: str, seps: list[str]) -> list[str]:
if not seps or len(text) <= chunk_size:
return [text] if text.strip() else []
sep, rest_seps = seps[0], seps[1:]
if sep == "":
return [text[i:i+chunk_size]
for i in range(0, len(text), chunk_size - overlap)]
parts = text.split(sep)
chunks, current = [], ""
for part in parts:
candidate = (current + sep + part) if current else part
if len(candidate) <= chunk_size:
current = candidate
else:
if current:
chunks.append(current)
current = part if len(part) <= chunk_size else ""
if len(part) > chunk_size:
chunks.extend(_split(part, rest_seps))
if current:
chunks.append(current)
return [c.strip() for c in chunks if c.strip()]
return _split(text, separators)
Strategy 3: Semantic Chunking (Highest Quality)
import numpy as np
from sentence_transformers import SentenceTransformer
import re
def semantic_chunk(text: str, model_name: str = "BAAI/bge-m3",
breakpoint_threshold: float = 0.7,
min_size: int = 200, max_size: int = 2000) -> list[str]:
"""Split at semantic boundaries detected via embedding similarity"""
model = SentenceTransformer(model_name)
sentences = re.split(r'(?<=[.!?。!?])\s+', text)
sentences = [s.strip() for s in sentences if s.strip()]
if len(sentences) <= 1:
return [text]
embeddings = model.encode(sentences, batch_size=32)
similarities = [
float(np.dot(embeddings[i], embeddings[i+1]) /
(np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1])))
for i in range(len(embeddings) - 1)
]
chunks, current, cur_size = [], [sentences[0]], len(sentences[0])
for sentence, sim in zip(sentences[1:], similarities):
break_here = (sim < breakpoint_threshold) or (cur_size + len(sentence) > max_size)
if break_here and cur_size >= min_size:
chunks.append(" ".join(current))
current, cur_size = [sentence], len(sentence)
else:
current.append(sentence)
cur_size += len(sentence)
if current:
chunks.append(" ".join(current))
return chunks
Strategy 4: Markdown-Aware Chunking
def markdown_chunk(text: str, max_size: int = 1500) -> list[dict]:
"""Structure-aware chunking that preserves heading hierarchy as metadata"""
chunks = []
sections = re.split(r'^(#{1,4}\s+.+)$', text, flags=re.MULTILINE)
header_stack = []
current_content = []
for item in sections:
header_match = re.match(r'^(#{1,4})\s+(.+)$', item)
if header_match:
if current_content:
content = "\n".join(current_content).strip()
if content:
chunks.append({
"content": content,
"breadcrumb": " > ".join(h[1] for h in header_stack),
})
level = len(header_match.group(1))
title = header_match.group(2)
header_stack = [(l, t) for l, t in header_stack if l < level]
header_stack.append((level, title))
current_content = [item]
else:
current_content.append(item)
if current_content:
content = "\n".join(current_content).strip()
if content:
chunks.append({
"content": content,
"breadcrumb": " > ".join(h[1] for h in header_stack),
})
return chunks
28.4 Embedding Models
| Model | Dimensions | Languages | Quality | Use Case |
|---|---|---|---|---|
BAAI/bge-m3 |
1024 | EN + ZH | High | General bilingual |
text-embedding-3-large |
3072 | Multilingual | Highest | High-quality production |
text-embedding-3-small |
1536 | Multilingual | Medium | Cost-sensitive |
nomic-embed-text |
768 | English | High | Open weights, self-hosted |
from openai import OpenAI
def batch_embed(texts: list[str],
model: str = "text-embedding-3-small") -> list[list[float]]:
client = OpenAI()
results = []
for i in range(0, len(texts), 100):
batch = texts[i:i+100]
resp = client.embeddings.create(model=model, input=batch)
results.extend([r.embedding for r in resp.data])
return results
28.5 Retrieval and Reranking
Vector Retrieval
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
import uuid
class RAGVectorStore:
def __init__(self, collection: str = "knowledge_base"):
self.client = QdrantClient(host="localhost", port=6333)
self.collection = collection
def index(self, chunks: list[dict], embeddings: list[list[float]]):
points = [
PointStruct(
id=str(uuid.uuid4()),
vector=emb,
payload={"content": c.get("content", c) if isinstance(c, dict) else c,
"source": c.get("source", "") if isinstance(c, dict) else "",
"breadcrumb": c.get("breadcrumb", "") if isinstance(c, dict) else ""}
)
for c, emb in zip(chunks, embeddings)
]
for i in range(0, len(points), 100):
self.client.upsert(collection_name=self.collection,
points=points[i:i+100])
def retrieve(self, query_embedding: list[float],
top_k: int = 20) -> list[dict]:
results = self.client.search(
collection_name=self.collection,
query_vector=query_embedding,
limit=top_k,
with_payload=True
)
return [
{"id": str(r.id), "content": r.payload["content"],
"source": r.payload.get("source", ""),
"breadcrumb": r.payload.get("breadcrumb", ""),
"score": r.score}
for r in results
]
Cross-Encoder Reranking
Vector search optimizes recall; cross-encoder reranking optimizes precision. The cross-encoder jointly encodes the query and each candidate, producing much more accurate relevance scores:
from sentence_transformers import CrossEncoder
class Reranker:
def __init__(self, model: str = "BAAI/bge-reranker-v2-m3"):
self.model = CrossEncoder(model)
def rerank(self, query: str, candidates: list[dict],
top_k: int = 5) -> list[dict]:
if not candidates:
return []
scores = self.model.predict([(query, c["content"]) for c in candidates])
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [{**c, "rerank_score": float(s)} for c, s in ranked[:top_k]]
Hybrid Search with BM25
Combine dense vector retrieval with sparse keyword matching for robust coverage:
from rank_bm25 import BM25Okapi
class HybridRetriever:
def __init__(self, corpus: list[str]):
self.corpus = corpus
# Simple whitespace tokenization (replace with language-appropriate tokenizer)
self.bm25 = BM25Okapi([doc.lower().split() for doc in corpus])
def bm25_retrieve(self, query: str, top_k: int = 20) -> list[tuple[int, float]]:
scores = self.bm25.get_scores(query.lower().split())
top = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:top_k]
return top # (index, score) pairs
@staticmethod
def rrf_merge(vector_results: list[dict],
bm25_results: list[tuple[int, float]],
all_chunks: list[dict],
alpha: float = 0.6,
k: int = 60,
top_n: int = 10) -> list[dict]:
"""Reciprocal Rank Fusion"""
scores: dict[str, float] = {}
id_to_chunk: dict[str, dict] = {r["id"]: r for r in vector_results}
for rank, result in enumerate(vector_results):
scores[result["id"]] = scores.get(result["id"], 0) + alpha / (k + rank + 1)
for rank, (idx, _) in enumerate(bm25_results):
if idx < len(all_chunks):
cid = all_chunks[idx].get("id", str(idx))
scores[cid] = scores.get(cid, 0) + (1 - alpha) / (k + rank + 1)
sorted_ids = sorted(scores, key=lambda x: scores[x], reverse=True)[:top_n]
return [{**id_to_chunk[i], "hybrid_score": scores[i]}
for i in sorted_ids if i in id_to_chunk]
28.6 Query Optimization
Multi-Query Rewriting
def rewrite_queries(client: anthropic.Anthropic, query: str,
history: list[dict] | None = None) -> list[str]:
"""Generate multiple retrieval-optimized query variants"""
ctx = ""
if history:
ctx = "\n".join(f"{m['role']}: {m['content'][:200]}" for m in history[-4:])
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
messages=[{"role": "user",
"content": f"""Given this conversation context (if any) and user query,
generate 3 retrieval-optimized variants. One per line, no numbering.
Context: {ctx}
Query: {query}
Requirements:
- Resolve pronouns to explicit nouns
- Expand abbreviations
- Create semantically diverse variants"""}]
)
variants = [l.strip() for l in resp.content[0].text.strip().split("\n") if l.strip()]
return [query] + variants[:3]
28.7 Complete RAG Pipeline
import anthropic
from sentence_transformers import SentenceTransformer
class ClaudeRAGPipeline:
def __init__(self, collection: str = "knowledge_base"):
self.claude = anthropic.Anthropic()
self.embedder = SentenceTransformer("BAAI/bge-m3")
self.store = RAGVectorStore(collection)
self.reranker = Reranker()
def _format_context(self, chunks: list[dict]) -> str:
parts = []
for i, c in enumerate(chunks, 1):
loc = f"{c.get('source', '')} > {c.get('breadcrumb', '')}".strip(" >")
parts.append(f"[Document {i}] Source: {loc}\n{c['content']}")
return "\n\n---\n\n".join(parts)
def query(self, question: str, history: list[dict] | None = None,
top_k_retrieve: int = 20, top_k_rerank: int = 5) -> str:
# Step 1: Multi-query rewriting
queries = rewrite_queries(self.claude, question, history)
# Step 2: Multi-query vector retrieval with deduplication
seen, all_results = set(), []
for q in queries:
emb = self.embedder.encode(q).tolist()
for r in self.store.retrieve(emb, top_k=top_k_retrieve // len(queries)):
if r["id"] not in seen:
all_results.append(r)
seen.add(r["id"])
# Step 3: Rerank
reranked = self.reranker.rerank(question, all_results, top_k=top_k_rerank)
# Step 4: Build context and call Claude
context = self._format_context(reranked)
messages = list(history or [])
messages.append({
"role": "user",
"content": f"<retrieved_documents>\n{context}\n</retrieved_documents>\n\n{question}"
})
response = self.claude.messages.create(
model="claude-opus-4-5",
max_tokens=2048,
system="""You are a knowledgeable assistant. Answer questions based solely
on the provided document excerpts.
Rules:
- Only use information from the provided documents
- If the documents don't contain the answer, say so explicitly
- Cite the source document when making specific claims (e.g., "According to Document 2...")
- Never fabricate information not present in the documents""",
messages=messages
)
return response.content[0].text
# Usage
pipeline = ClaudeRAGPipeline("company_docs")
# Offline indexing (run once)
docs = [
{"content": "Our refund policy allows returns within 30 days...",
"source": "policy.md", "breadcrumb": "Refund Policy"},
{"content": "Product features include real-time sync...",
"source": "features.md", "breadcrumb": "Core Features"}
]
embeddings = [pipeline.embedder.encode(d["content"]).tolist() for d in docs]
pipeline.store.index(docs, embeddings)
# Online querying
answer = pipeline.query("What is your refund process?")
print(answer)
Summary
RAG is the most practical architecture for grounding Claude in external knowledge. The engineering quality breakdown:
- Chunking — Recursive chunking is the reliable default; semantic chunking maximizes quality; structure-aware chunking preserves document hierarchy as searchable metadata
- Embedding —
BAAI/bge-m3for bilingual EN/ZH;text-embedding-3-largefor highest English quality - Retrieval — Combine dense vector search with BM25 sparse retrieval via RRF fusion for maximum recall
- Reranking — CrossEncoder rerankers dramatically improve precision over pure vector similarity
- Query optimization — Multi-query rewriting with pronoun resolution increases retrieval coverage
- Context injection — Structured format with source attribution enables Claude to generate grounded, citable answers
The next chapter enters Part 6, examining the Managed Agents ecosystem on Claude.ai — Projects, Artifacts, and Agent lifecycle management.