Chapter 28

Multi-Session Software Development Pattern: Three-Phase Architecture of Initializer / Subsequent / End-of-session

Chapter 28: RAG Architecture with Claude: Engineering Practice of Retrieval-Augmented Generation

28.1 The Core Problem RAG Solves

Large language models have a fundamental tension at their core: their knowledge is frozen at training time, their context windows have hard token limits, yet the knowledge bases that organizations need to query grow without bound.

Retrieval-Augmented Generation (RAG) resolves this tension by dynamically retrieving the most relevant document fragments at query time and injecting only those into Claude's context. The model never needs to "know" your entire knowledge base — it only needs to reason over the relevant portions it receives.

RAG vs Fine-Tuning vs Long Context

Approach Best For Strengths Weaknesses
RAG Frequently updated knowledge, factual Q&A Real-time updates, explainable, cost-efficient Quality depends on retrieval engineering
Fine-tuning Fixed style, format, or domain behavior Knowledge "internalized" in model weights High update cost, no real-time refresh
Long-context injection Small document sets (<10 docs) Simple implementation High token cost, diluted attention

For most enterprise knowledge base scenarios — thousands to hundreds of thousands of documents — RAG is the most practical engineering choice.

28.2 RAG System Architecture

┌────────────────────────────────────────────────────────┐
│                    Offline Indexing Pipeline            │
│                                                        │
│  Raw Docs → Load → Chunk → Embed → VectorDB Store      │
│  (PDF/MD)  (Loader) (Chunker) (Model) (Qdrant/etc.)    │
└────────────────────────────────────────────────────────┘
                          │
                     Vector Database
                          │
┌────────────────────────────────────────────────────────┐
│                    Online Query Pipeline                 │
│                                                        │
│  Query → Rewrite → Embed → Retrieve → Rerank → Context │
│          (optional)        (Top-K)   (Cross-enc) Build  │
│                                            │            │
│                                       Claude API        │
│                                            │            │
│                                       Final Answer      │
└────────────────────────────────────────────────────────┘

28.3 Chunking Strategies

Chunking is the most overlooked yet highest-impact component of RAG. Poor chunking causes truncated semantics, redundant overlaps, and retrieval failures regardless of how good the embedding model is.

Strategy 1: Fixed-Size Chunking (Baseline)

def fixed_size_chunk(text: str, chunk_size: int = 512,
                     overlap: int = 64) -> list[str]:
    """Simple sliding window chunking"""
    chunks = []
    start = 0
    while start < len(text):
        chunks.append(text[start:start + chunk_size])
        start += chunk_size - overlap
    return chunks

Fast and deterministic, but ignores natural document structure.

def recursive_chunk(text: str, chunk_size: int = 1000,
                    overlap: int = 100,
                    separators: list[str] | None = None) -> list[str]:
    """
    Split on natural boundaries in priority order:
    paragraph > newline > sentence > word > character
    """
    if separators is None:
        separators = ["\n\n", "\n", ". ", "! ", "? ", " ", ""]

    def _split(text: str, seps: list[str]) -> list[str]:
        if not seps or len(text) <= chunk_size:
            return [text] if text.strip() else []

        sep, rest_seps = seps[0], seps[1:]
        if sep == "":
            return [text[i:i+chunk_size]
                    for i in range(0, len(text), chunk_size - overlap)]

        parts = text.split(sep)
        chunks, current = [], ""

        for part in parts:
            candidate = (current + sep + part) if current else part
            if len(candidate) <= chunk_size:
                current = candidate
            else:
                if current:
                    chunks.append(current)
                current = part if len(part) <= chunk_size else ""
                if len(part) > chunk_size:
                    chunks.extend(_split(part, rest_seps))

        if current:
            chunks.append(current)
        return [c.strip() for c in chunks if c.strip()]

    return _split(text, separators)

Strategy 3: Semantic Chunking (Highest Quality)

import numpy as np
from sentence_transformers import SentenceTransformer
import re

def semantic_chunk(text: str, model_name: str = "BAAI/bge-m3",
                   breakpoint_threshold: float = 0.7,
                   min_size: int = 200, max_size: int = 2000) -> list[str]:
    """Split at semantic boundaries detected via embedding similarity"""
    model = SentenceTransformer(model_name)

    sentences = re.split(r'(?<=[.!?。!?])\s+', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    if len(sentences) <= 1:
        return [text]

    embeddings = model.encode(sentences, batch_size=32)
    similarities = [
        float(np.dot(embeddings[i], embeddings[i+1]) /
              (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1])))
        for i in range(len(embeddings) - 1)
    ]

    chunks, current, cur_size = [], [sentences[0]], len(sentences[0])
    for sentence, sim in zip(sentences[1:], similarities):
        break_here = (sim < breakpoint_threshold) or (cur_size + len(sentence) > max_size)
        if break_here and cur_size >= min_size:
            chunks.append(" ".join(current))
            current, cur_size = [sentence], len(sentence)
        else:
            current.append(sentence)
            cur_size += len(sentence)
    if current:
        chunks.append(" ".join(current))
    return chunks

Strategy 4: Markdown-Aware Chunking

def markdown_chunk(text: str, max_size: int = 1500) -> list[dict]:
    """Structure-aware chunking that preserves heading hierarchy as metadata"""
    chunks = []
    sections = re.split(r'^(#{1,4}\s+.+)$', text, flags=re.MULTILINE)

    header_stack = []
    current_content = []

    for item in sections:
        header_match = re.match(r'^(#{1,4})\s+(.+)$', item)
        if header_match:
            if current_content:
                content = "\n".join(current_content).strip()
                if content:
                    chunks.append({
                        "content": content,
                        "breadcrumb": " > ".join(h[1] for h in header_stack),
                    })
            level = len(header_match.group(1))
            title = header_match.group(2)
            header_stack = [(l, t) for l, t in header_stack if l < level]
            header_stack.append((level, title))
            current_content = [item]
        else:
            current_content.append(item)

    if current_content:
        content = "\n".join(current_content).strip()
        if content:
            chunks.append({
                "content": content,
                "breadcrumb": " > ".join(h[1] for h in header_stack),
            })
    return chunks

28.4 Embedding Models

Model Dimensions Languages Quality Use Case
BAAI/bge-m3 1024 EN + ZH High General bilingual
text-embedding-3-large 3072 Multilingual Highest High-quality production
text-embedding-3-small 1536 Multilingual Medium Cost-sensitive
nomic-embed-text 768 English High Open weights, self-hosted
from openai import OpenAI

def batch_embed(texts: list[str],
                model: str = "text-embedding-3-small") -> list[list[float]]:
    client = OpenAI()
    results = []
    for i in range(0, len(texts), 100):
        batch = texts[i:i+100]
        resp = client.embeddings.create(model=model, input=batch)
        results.extend([r.embedding for r in resp.data])
    return results

28.5 Retrieval and Reranking

Vector Retrieval

from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
import uuid

class RAGVectorStore:
    def __init__(self, collection: str = "knowledge_base"):
        self.client = QdrantClient(host="localhost", port=6333)
        self.collection = collection

    def index(self, chunks: list[dict], embeddings: list[list[float]]):
        points = [
            PointStruct(
                id=str(uuid.uuid4()),
                vector=emb,
                payload={"content": c.get("content", c) if isinstance(c, dict) else c,
                         "source": c.get("source", "") if isinstance(c, dict) else "",
                         "breadcrumb": c.get("breadcrumb", "") if isinstance(c, dict) else ""}
            )
            for c, emb in zip(chunks, embeddings)
        ]
        for i in range(0, len(points), 100):
            self.client.upsert(collection_name=self.collection,
                               points=points[i:i+100])

    def retrieve(self, query_embedding: list[float],
                 top_k: int = 20) -> list[dict]:
        results = self.client.search(
            collection_name=self.collection,
            query_vector=query_embedding,
            limit=top_k,
            with_payload=True
        )
        return [
            {"id": str(r.id), "content": r.payload["content"],
             "source": r.payload.get("source", ""),
             "breadcrumb": r.payload.get("breadcrumb", ""),
             "score": r.score}
            for r in results
        ]

Cross-Encoder Reranking

Vector search optimizes recall; cross-encoder reranking optimizes precision. The cross-encoder jointly encodes the query and each candidate, producing much more accurate relevance scores:

from sentence_transformers import CrossEncoder

class Reranker:
    def __init__(self, model: str = "BAAI/bge-reranker-v2-m3"):
        self.model = CrossEncoder(model)

    def rerank(self, query: str, candidates: list[dict],
               top_k: int = 5) -> list[dict]:
        if not candidates:
            return []
        scores = self.model.predict([(query, c["content"]) for c in candidates])
        ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
        return [{**c, "rerank_score": float(s)} for c, s in ranked[:top_k]]

Hybrid Search with BM25

Combine dense vector retrieval with sparse keyword matching for robust coverage:

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, corpus: list[str]):
        self.corpus = corpus
        # Simple whitespace tokenization (replace with language-appropriate tokenizer)
        self.bm25 = BM25Okapi([doc.lower().split() for doc in corpus])

    def bm25_retrieve(self, query: str, top_k: int = 20) -> list[tuple[int, float]]:
        scores = self.bm25.get_scores(query.lower().split())
        top = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)[:top_k]
        return top  # (index, score) pairs

    @staticmethod
    def rrf_merge(vector_results: list[dict],
                  bm25_results: list[tuple[int, float]],
                  all_chunks: list[dict],
                  alpha: float = 0.6,
                  k: int = 60,
                  top_n: int = 10) -> list[dict]:
        """Reciprocal Rank Fusion"""
        scores: dict[str, float] = {}
        id_to_chunk: dict[str, dict] = {r["id"]: r for r in vector_results}

        for rank, result in enumerate(vector_results):
            scores[result["id"]] = scores.get(result["id"], 0) + alpha / (k + rank + 1)

        for rank, (idx, _) in enumerate(bm25_results):
            if idx < len(all_chunks):
                cid = all_chunks[idx].get("id", str(idx))
                scores[cid] = scores.get(cid, 0) + (1 - alpha) / (k + rank + 1)

        sorted_ids = sorted(scores, key=lambda x: scores[x], reverse=True)[:top_n]
        return [{**id_to_chunk[i], "hybrid_score": scores[i]}
                for i in sorted_ids if i in id_to_chunk]

28.6 Query Optimization

Multi-Query Rewriting

def rewrite_queries(client: anthropic.Anthropic, query: str,
                    history: list[dict] | None = None) -> list[str]:
    """Generate multiple retrieval-optimized query variants"""
    ctx = ""
    if history:
        ctx = "\n".join(f"{m['role']}: {m['content'][:200]}" for m in history[-4:])

    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        messages=[{"role": "user",
                   "content": f"""Given this conversation context (if any) and user query,
generate 3 retrieval-optimized variants. One per line, no numbering.

Context: {ctx}

Query: {query}

Requirements:
- Resolve pronouns to explicit nouns
- Expand abbreviations
- Create semantically diverse variants"""}]
    )
    variants = [l.strip() for l in resp.content[0].text.strip().split("\n") if l.strip()]
    return [query] + variants[:3]

28.7 Complete RAG Pipeline

import anthropic
from sentence_transformers import SentenceTransformer

class ClaudeRAGPipeline:
    def __init__(self, collection: str = "knowledge_base"):
        self.claude = anthropic.Anthropic()
        self.embedder = SentenceTransformer("BAAI/bge-m3")
        self.store = RAGVectorStore(collection)
        self.reranker = Reranker()

    def _format_context(self, chunks: list[dict]) -> str:
        parts = []
        for i, c in enumerate(chunks, 1):
            loc = f"{c.get('source', '')} > {c.get('breadcrumb', '')}".strip(" >")
            parts.append(f"[Document {i}] Source: {loc}\n{c['content']}")
        return "\n\n---\n\n".join(parts)

    def query(self, question: str, history: list[dict] | None = None,
              top_k_retrieve: int = 20, top_k_rerank: int = 5) -> str:
        # Step 1: Multi-query rewriting
        queries = rewrite_queries(self.claude, question, history)

        # Step 2: Multi-query vector retrieval with deduplication
        seen, all_results = set(), []
        for q in queries:
            emb = self.embedder.encode(q).tolist()
            for r in self.store.retrieve(emb, top_k=top_k_retrieve // len(queries)):
                if r["id"] not in seen:
                    all_results.append(r)
                    seen.add(r["id"])

        # Step 3: Rerank
        reranked = self.reranker.rerank(question, all_results, top_k=top_k_rerank)

        # Step 4: Build context and call Claude
        context = self._format_context(reranked)

        messages = list(history or [])
        messages.append({
            "role": "user",
            "content": f"<retrieved_documents>\n{context}\n</retrieved_documents>\n\n{question}"
        })

        response = self.claude.messages.create(
            model="claude-opus-4-5",
            max_tokens=2048,
            system="""You are a knowledgeable assistant. Answer questions based solely
on the provided document excerpts.

Rules:
- Only use information from the provided documents
- If the documents don't contain the answer, say so explicitly
- Cite the source document when making specific claims (e.g., "According to Document 2...")
- Never fabricate information not present in the documents""",
            messages=messages
        )
        return response.content[0].text


# Usage
pipeline = ClaudeRAGPipeline("company_docs")

# Offline indexing (run once)
docs = [
    {"content": "Our refund policy allows returns within 30 days...",
     "source": "policy.md", "breadcrumb": "Refund Policy"},
    {"content": "Product features include real-time sync...",
     "source": "features.md", "breadcrumb": "Core Features"}
]
embeddings = [pipeline.embedder.encode(d["content"]).tolist() for d in docs]
pipeline.store.index(docs, embeddings)

# Online querying
answer = pipeline.query("What is your refund process?")
print(answer)

Summary

RAG is the most practical architecture for grounding Claude in external knowledge. The engineering quality breakdown:

  1. Chunking — Recursive chunking is the reliable default; semantic chunking maximizes quality; structure-aware chunking preserves document hierarchy as searchable metadata
  2. EmbeddingBAAI/bge-m3 for bilingual EN/ZH; text-embedding-3-large for highest English quality
  3. Retrieval — Combine dense vector search with BM25 sparse retrieval via RRF fusion for maximum recall
  4. Reranking — CrossEncoder rerankers dramatically improve precision over pure vector similarity
  5. Query optimization — Multi-query rewriting with pronoun resolution increases retrieval coverage
  6. Context injection — Structured format with source attribution enables Claude to generate grounded, citable answers

The next chapter enters Part 6, examining the Managed Agents ecosystem on Claude.ai — Projects, Artifacts, and Agent lifecycle management.

Rate this chapter
4.5  / 5  (5 ratings)

💬 Comments