Chapter 52

RAG Pipeline: Retrieval-Augmented Generation

L1: Concept — Why RAG Exists

The Three Fundamental Limitations of LLMs

Large language models are a compressed representation of human knowledge, but this "compression" introduces three fundamental limitations that cannot be solved simply by making models larger.

Limitation 1: Knowledge Cutoff

LLM training data has a cutoff point in time. Claude doesn't know what happened in today's news, GPT-4 doesn't know about the latest API changes, and no model can access your company's internal business data. This isn't a bug — it's an inherent property of any system based on training data.

Limitation 2: Private Data Is Invisible

Your company may have tens of thousands of internal documents, customer service records, technical documentation, meeting notes. None of this data has ever appeared in any public training dataset. Even if you use the most powerful LLM, it knows nothing about this data.

Limitation 3: Hallucination

When an LLM is asked about something it's uncertain of, it tends to generate content that sounds plausible but is actually wrong — this is "hallucination." The root cause of hallucination is that an LLM's goal is to generate linguistically coherent, semantically plausible text, not to be strictly consistent with facts.

RAG's Core Idea: Retrieve First, Then Generate

RAG (Retrieval-Augmented Generation) is a strikingly direct idea:

Before asking the LLM a question, first retrieve the most relevant document fragments from a knowledge base, then send those fragments as context along with the question to the LLM, so the LLM can generate an answer based on that evidence.

This simple idea addresses all three limitations:

The knowledge base can be updated at any time (solves the knowledge cutoff problem)
Private data can be added to the knowledge base (solves the private data problem)
The LLM answers based on real evidence, dramatically reducing hallucinations

From an information retrieval perspective, RAG combines a traditional search engine (retrieve relevant documents) with an LLM (understand and generate text). Each part does what it does best.

RAG vs Fine-Tuning: Which Should You Choose?

Beginners are often confused: why use RAG instead of directly fine-tuning the LLM on private data?

Fine-tuning advantages:

The model can learn domain-specific language style and terminology
Faster inference (no retrieval step)
Good for teaching the model specific output formats or behaviors

Fine-tuning limitations:

High cost to update: every knowledge base update requires re-fine-tuning, with enormous time and compute costs
Cannot handle very large document corpora: fine-tuning only changes model "intuitions," it can't make a model "memorize" the specific contents of thousands of documents
Hallucinations persist: fine-tuning cannot guarantee the model answers strictly based on training data
Expensive: high-quality fine-tuning requires substantial GPU compute and professionally annotated data

RAG advantages:

Knowledge base can be updated in real time (add new documents, delete outdated content)
Answers are attributable (you can tell users "this answer comes from document X, page Y")
Controlled cost (only maintain a vector database, no model retraining needed)
More effective against hallucinations (the LLM is explicitly instructed to "answer based on the following")

Conclusion: Most enterprise AI applications should try RAG first. Only consider fine-tuning when RAG's performance is genuinely insufficient and you have adequate high-quality annotated data.

L2: Principles — The Two Stages of the RAG Pipeline

Stage 1: Ingestion Pipeline

The ingestion pipeline transforms raw documents into a searchable vector index. This process has four steps:

Step 1: Load

Load raw documents in various formats:

PDF    → text extraction (preserving chapter structure)
Word   → paragraph extraction
Markdown → use directly
HTML   → strip tags, preserve content
CSV/Excel → convert row contents to text
Database records → format as text

Document loading is a step that appears simple but is actually complex. The difficulties include:

PDFs may be scanned images (require OCR)
Documents may contain tables (how to preserve table semantics?)
Some documents have special sectional structure (like legal contracts, technical manuals)

Step 2: Chunk

Split long documents into smaller chunks. This step is one of the most critical determinants of RAG performance.

Why chunk?

Embedding models have token length limits (usually 512-8192 tokens)
Retrieval returns chunks; chunks that are too large introduce irrelevant content, while chunks that are too small lose context
Vector similarity works best on semantically cohesive text chunks

Three main chunking strategies:

Fixed-size chunking: Split into fixed token counts with overlap between adjacent chunks. Pros: simple to implement, good for documents without clear structure. Cons: may cut mid-sentence, disrupting semantic integrity.

Semantic chunking: Split at semantic boundaries (paragraphs, sections, sentences). Pros: preserves complete semantic units. Cons: uneven chunk sizes, complex to implement.

Recursive character chunking: Recursively split using a hierarchy of separators: ["\n\n", "\n", ". ", "! ", "? ", ", ", " ", ""]. Prefers paragraph boundaries, then sentence boundaries, then word boundaries. Balances fixed-size and semantic boundary approaches.

Step 3: Embed

Convert text chunks into dense vectors. This step uses an embedding model to map semantically similar text to nearby positions in vector space.

Comparison of popular embedding models:

Model	Dimensions	Max Tokens	Notes
text-embedding-3-small	1536	8191	OpenAI, cost-effective
text-embedding-3-large	3072	8191	OpenAI, highest accuracy
voyage-3	1024	32000	Anthropic recommended, great for long docs
nomic-embed-text	768	8192	Open source, locally deployable

Step 4: Store

Store vectors in a vector database. Comparison of popular vector databases:

Database	Type	Notes
pgvector	PostgreSQL extension	Seamless integration with existing PG stack, good for medium scale
Qdrant	Standalone service	Excellent performance, rich filtering
Chroma	Embedded	Simple to develop, good for local prototyping
Pinecone	Cloud service	Fully managed, good for large-scale production
Weaviate	Standalone service	Built-in hybrid search

Stage 2: Retrieval and Generation

Step 1: Embed the Query

The user's question is converted to a vector using the same embedding model. The key here is: the query vector and document vectors must use the same model to be compared in the same vector space.

Step 2: Approximate Nearest Neighbor Search (ANN)

Find the K document chunks most similar to the query vector in the vector database. Similarity measures:

Cosine Similarity: The most common metric. Not affected by vector magnitude. Ideal for text embeddings.

$$\text{sim}(A, B) = \frac{A \cdot B}{|A| |B|}$$

Dot Product: Suitable for already-normalized vectors (equivalent to cosine similarity after normalization).

Euclidean Distance: Better suited for image embeddings and similar domains; less common for text.

Step 3: Rerank

Initial retrieval (using vector similarity) has limited precision. The reranking step uses a cross-encoder model to precisely re-rank retrieval results:

Bi-encoder (vector similarity): fast, but limited precision
Cross-encoder (reranking): slow, but higher precision

Workflow: use bi-encoder to quickly recall Top-50, then use cross-encoder to re-rank to Top-5

Popular reranking services: Cohere Rerank, Jina Reranker, BGE Reranker (open source).

Step 4: Context Assembly and Generation

Assemble retrieved document chunks into context, then send to the LLM:

[System Prompt]
You are a professional assistant. Answer the question based ONLY on the reference materials below.
If the reference materials do not contain relevant information, clearly say so.

[Reference Materials]
Source: Document A, Chapter 3
Content: ...

Source: Document B, Page 7
Content: ...

[User Question]
Please explain...

L3: Code Practice — Building a Complete Go RAG Pipeline

Document Loader

// rag/loader.go
package rag

import (
    "fmt"
    "os"
    "path/filepath"
    "strings"
)

// Document represents a loaded document
type Document struct {
    ID       string
    Content  string
    Source   string
    Metadata map[string]string
}

// MarkdownLoader loads Markdown files
type MarkdownLoader struct{}

func (l *MarkdownLoader) Load(path string) ([]Document, error) {
    files, err := filepath.Glob(path)
    if err != nil {
        return nil, err
    }

    var docs []Document
    for _, file := range files {
        content, err := os.ReadFile(file)
        if err != nil {
            return nil, fmt.Errorf("read %s: %w", file, err)
        }
        docs = append(docs, Document{
            ID:      file,
            Content: string(content),
            Source:  file,
            Metadata: map[string]string{
                "filename": filepath.Base(file),
                "type":     "markdown",
            },
        })
    }
    return docs, nil
}

// DirectoryLoader recursively loads all documents in a directory
type DirectoryLoader struct {
    Extensions []string
}

func (l *DirectoryLoader) Load(dirPath string) ([]Document, error) {
    var docs []Document
    err := filepath.Walk(dirPath, func(path string, info os.FileInfo, err error) error {
        if err != nil || info.IsDir() {
            return err
        }
        ext := strings.ToLower(filepath.Ext(path))
        for _, supported := range l.Extensions {
            if ext == supported {
                content, err := os.ReadFile(path)
                if err != nil {
                    return err
                }
                docs = append(docs, Document{
                    ID:      path,
                    Content: string(content),
                    Source:  path,
                    Metadata: map[string]string{
                        "filename": info.Name(),
                        "type":     ext[1:],
                    },
                })
                break
            }
        }
        return nil
    })
    return docs, err
}

Recursive Text Splitter

// rag/splitter.go
package rag

import (
    "strings"
    "unicode/utf8"
)

// Chunk represents a text chunk
type Chunk struct {
    Content  string
    Source   string
    ChunkIdx int
    Metadata map[string]string
}

// RecursiveTextSplitter implements recursive character splitting
type RecursiveTextSplitter struct {
    ChunkSize    int
    ChunkOverlap int
    Separators   []string
}

func NewRecursiveTextSplitter(chunkSize, overlap int) *RecursiveTextSplitter {
    return &RecursiveTextSplitter{
        ChunkSize:    chunkSize,
        ChunkOverlap: overlap,
        Separators:   []string{"\n\n", "\n", ". ", "! ", "? ", ", ", " ", ""},
    }
}

func (s *RecursiveTextSplitter) Split(doc Document) []Chunk {
    texts := s.splitText(doc.Content, s.Separators)
    var chunks []Chunk
    for i, text := range texts {
        if strings.TrimSpace(text) == "" {
            continue
        }
        chunks = append(chunks, Chunk{
            Content:  text,
            Source:   doc.Source,
            ChunkIdx: i,
            Metadata: doc.Metadata,
        })
    }
    return chunks
}

func (s *RecursiveTextSplitter) splitText(text string, separators []string) []string {
    if len(separators) == 0 || utf8.RuneCountInString(text) <= s.ChunkSize {
        return []string{text}
    }

    separator := separators[len(separators)-1]
    for _, sep := range separators {
        if strings.Contains(text, sep) {
            separator = sep
            break
        }
    }

    var goodSplits []string
    var currentChunk strings.Builder
    splits := strings.Split(text, separator)

    for _, split := range splits {
        if currentChunk.Len()+len(split)+len(separator) <= s.ChunkSize*4 {
            if currentChunk.Len() > 0 {
                currentChunk.WriteString(separator)
            }
            currentChunk.WriteString(split)
        } else {
            if currentChunk.Len() > 0 {
                goodSplits = append(goodSplits, currentChunk.String())
                overlap := s.getOverlap(currentChunk.String())
                currentChunk.Reset()
                currentChunk.WriteString(overlap)
                if currentChunk.Len() > 0 {
                    currentChunk.WriteString(separator)
                }
            }
            currentChunk.WriteString(split)
        }
    }

    if currentChunk.Len() > 0 {
        goodSplits = append(goodSplits, currentChunk.String())
    }

    var result []string
    for _, gs := range goodSplits {
        if utf8.RuneCountInString(gs) > s.ChunkSize*4 {
            result = append(result, s.splitText(gs, separators[1:])...)
        } else {
            result = append(result, gs)
        }
    }
    return result
}

func (s *RecursiveTextSplitter) getOverlap(text string) string {
    runes := []rune(text)
    overlapChars := s.ChunkOverlap * 4
    if len(runes) <= overlapChars {
        return text
    }
    return string(runes[len(runes)-overlapChars:])
}

Embedding API Client

// rag/embedder.go
package rag

import (
    "bytes"
    "context"
    "encoding/json"
    "fmt"
    "net/http"
    "os"
    "time"
)

// Embedder converts text into vectors
type Embedder interface {
    Embed(ctx context.Context, texts []string) ([][]float32, error)
}

// OpenAIEmbedder uses the OpenAI Embeddings API
type OpenAIEmbedder struct {
    apiKey string
    model  string
    client *http.Client
}

func NewOpenAIEmbedder(model string) *OpenAIEmbedder {
    return &OpenAIEmbedder{
        apiKey: os.Getenv("OPENAI_API_KEY"),
        model:  model,
        client: &http.Client{Timeout: 30 * time.Second},
    }
}

func (e *OpenAIEmbedder) Embed(ctx context.Context, texts []string) ([][]float32, error) {
    body, _ := json.Marshal(map[string]interface{}{
        "model": e.model,
        "input": texts,
    })

    req, err := http.NewRequestWithContext(ctx, "POST",
        "https://api.openai.com/v1/embeddings", bytes.NewReader(body))
    if err != nil {
        return nil, err
    }
    req.Header.Set("Authorization", "Bearer "+e.apiKey)
    req.Header.Set("Content-Type", "application/json")

    resp, err := e.client.Do(req)
    if err != nil {
        return nil, fmt.Errorf("embedding request: %w", err)
    }
    defer resp.Body.Close()

    var result struct {
        Data []struct {
            Embedding []float32 `json:"embedding"`
        } `json:"data"`
    }
    if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
        return nil, err
    }

    embeddings := make([][]float32, len(result.Data))
    for i, d := range result.Data {
        embeddings[i] = d.Embedding
    }
    return embeddings, nil
}

// VoyageEmbedder uses the Voyage AI Embeddings API (recommended by Anthropic)
type VoyageEmbedder struct {
    apiKey string
    model  string
    client *http.Client
}

func NewVoyageEmbedder(model string) *VoyageEmbedder {
    return &VoyageEmbedder{
        apiKey: os.Getenv("VOYAGE_API_KEY"),
        model:  model,
        client: &http.Client{Timeout: 30 * time.Second},
    }
}

func (e *VoyageEmbedder) Embed(ctx context.Context, texts []string) ([][]float32, error) {
    body, _ := json.Marshal(map[string]interface{}{
        "model": e.model,
        "input": texts,
    })

    req, err := http.NewRequestWithContext(ctx, "POST",
        "https://api.voyageai.com/v1/embeddings", bytes.NewReader(body))
    if err != nil {
        return nil, err
    }
    req.Header.Set("Authorization", "Bearer "+e.apiKey)
    req.Header.Set("Content-Type", "application/json")

    resp, err := e.client.Do(req)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()

    var result struct {
        Data []struct {
            Embedding []float32 `json:"embedding"`
        } `json:"data"`
    }
    if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
        return nil, err
    }

    embeddings := make([][]float32, len(result.Data))
    for i, d := range result.Data {
        embeddings[i] = d.Embedding
    }
    return embeddings, nil
}

pgvector Store

// rag/store.go
package rag

import (
    "context"
    "database/sql"
    "fmt"
    "strings"

    _ "github.com/lib/pq"
)

// SearchResult represents a retrieval result
type SearchResult struct {
    Chunk      Chunk
    Similarity float64
}

// PgVectorStore uses pgvector for vector storage and retrieval
type PgVectorStore struct {
    db        *sql.DB
    tableName string
    dimension int
}

func NewPgVectorStore(dsn, tableName string, dimension int) (*PgVectorStore, error) {
    db, err := sql.Open("postgres", dsn)
    if err != nil {
        return nil, err
    }
    store := &PgVectorStore{db: db, tableName: tableName, dimension: dimension}
    if err := store.initialize(context.Background()); err != nil {
        return nil, err
    }
    return store, nil
}

func (s *PgVectorStore) initialize(ctx context.Context) error {
    queries := []string{
        "CREATE EXTENSION IF NOT EXISTS vector",
        fmt.Sprintf(`CREATE TABLE IF NOT EXISTS %s (
            id BIGSERIAL PRIMARY KEY,
            content TEXT NOT NULL,
            source TEXT,
            chunk_idx INTEGER,
            metadata JSONB DEFAULT '{}',
            embedding vector(%d)
        )`, s.tableName, s.dimension),
        fmt.Sprintf(`CREATE INDEX IF NOT EXISTS %s_embedding_idx
            ON %s USING ivfflat (embedding vector_cosine_ops)
            WITH (lists = 100)`, s.tableName, s.tableName),
    }
    for _, q := range queries {
        if _, err := s.db.ExecContext(ctx, q); err != nil {
            return fmt.Errorf("initialize: %w", err)
        }
    }
    return nil
}

// Insert batch-inserts document chunks with their embeddings
func (s *PgVectorStore) Insert(ctx context.Context, chunks []Chunk, embeddings [][]float32) error {
    if len(chunks) != len(embeddings) {
        return fmt.Errorf("chunks and embeddings length mismatch")
    }

    tx, err := s.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    stmt, err := tx.PrepareContext(ctx, fmt.Sprintf(
        `INSERT INTO %s (content, source, chunk_idx, embedding) VALUES ($1, $2, $3, $4)`,
        s.tableName))
    if err != nil {
        return err
    }
    defer stmt.Close()

    for i, chunk := range chunks {
        embStr := float32SliceToString(embeddings[i])
        if _, err := stmt.ExecContext(ctx, chunk.Content, chunk.Source, chunk.ChunkIdx, embStr); err != nil {
            return fmt.Errorf("insert chunk %d: %w", i, err)
        }
    }

    return tx.Commit()
}

// Search retrieves the K most similar chunks using cosine similarity
func (s *PgVectorStore) Search(ctx context.Context, queryEmbedding []float32, k int) ([]SearchResult, error) {
    embStr := float32SliceToString(queryEmbedding)

    rows, err := s.db.QueryContext(ctx, fmt.Sprintf(
        `SELECT content, source, chunk_idx,
                1 - (embedding <=> $1::vector) AS similarity
         FROM %s
         ORDER BY embedding <=> $1::vector
         LIMIT $2`, s.tableName),
        embStr, k)
    if err != nil {
        return nil, fmt.Errorf("search: %w", err)
    }
    defer rows.Close()

    var results []SearchResult
    for rows.Next() {
        var r SearchResult
        if err := rows.Scan(&r.Chunk.Content, &r.Chunk.Source,
            &r.Chunk.ChunkIdx, &r.Similarity); err != nil {
            return nil, err
        }
        results = append(results, r)
    }
    return results, rows.Err()
}

// HybridSearch combines vector search and full-text search (BM25) using RRF
func (s *PgVectorStore) HybridSearch(ctx context.Context, query string, queryEmbedding []float32, k int) ([]SearchResult, error) {
    embStr := float32SliceToString(queryEmbedding)

    rows, err := s.db.QueryContext(ctx, fmt.Sprintf(
        `WITH vector_search AS (
            SELECT id, 1 - (embedding <=> $1::vector) AS vector_score
            FROM %s ORDER BY embedding <=> $1::vector LIMIT 50
         ),
         text_search AS (
            SELECT id, ts_rank(to_tsvector('english', content), plainto_tsquery('english', $2)) AS text_score
            FROM %s WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2) LIMIT 50
         ),
         rrf AS (
            SELECT COALESCE(v.id, t.id) AS id,
                   COALESCE(1.0/(60+ROW_NUMBER() OVER (ORDER BY v.vector_score DESC)), 0) +
                   COALESCE(1.0/(60+ROW_NUMBER() OVER (ORDER BY t.text_score DESC)), 0) AS rrf_score
            FROM vector_search v FULL OUTER JOIN text_search t ON v.id = t.id
         )
         SELECT d.content, d.source, d.chunk_idx, r.rrf_score
         FROM rrf r JOIN %s d ON r.id = d.id
         ORDER BY r.rrf_score DESC LIMIT $3`,
        s.tableName, s.tableName, s.tableName),
        embStr, query, k)
    if err != nil {
        return nil, err
    }
    defer rows.Close()

    var results []SearchResult
    for rows.Next() {
        var r SearchResult
        if err := rows.Scan(&r.Chunk.Content, &r.Chunk.Source,
            &r.Chunk.ChunkIdx, &r.Similarity); err != nil {
            return nil, err
        }
        results = append(results, r)
    }
    return results, rows.Err()
}

func float32SliceToString(v []float32) string {
    parts := make([]string, len(v))
    for i, f := range v {
        parts[i] = fmt.Sprintf("%f", f)
    }
    return "[" + strings.Join(parts, ",") + "]"
}

Complete RAG Pipeline

// rag/pipeline.go
package rag

import (
    "context"
    "fmt"
    "strings"
)

// Pipeline is the complete RAG pipeline
type Pipeline struct {
    splitter *RecursiveTextSplitter
    embedder Embedder
    store    *PgVectorStore
    reranker Reranker
    llm      LLMClient
}

type Reranker interface {
    Rerank(ctx context.Context, query string, docs []SearchResult, topK int) ([]SearchResult, error)
}

type LLMClient interface {
    Complete(ctx context.Context, systemPrompt, userMessage string) (string, error)
    StreamComplete(ctx context.Context, systemPrompt, userMessage string, out chan<- string) error
}

func NewPipeline(embedder Embedder, store *PgVectorStore, reranker Reranker, llm LLMClient) *Pipeline {
    return &Pipeline{
        splitter: NewRecursiveTextSplitter(1000, 200),
        embedder: embedder,
        store:    store,
        reranker: reranker,
        llm:      llm,
    }
}

// Ingest ingests documents into the knowledge base
func (p *Pipeline) Ingest(ctx context.Context, docs []Document) error {
    var allChunks []Chunk
    for _, doc := range docs {
        chunks := p.splitter.Split(doc)
        allChunks = append(allChunks, chunks...)
    }

    batchSize := 100
    for i := 0; i < len(allChunks); i += batchSize {
        end := i + batchSize
        if end > len(allChunks) {
            end = len(allChunks)
        }
        batch := allChunks[i:end]

        texts := make([]string, len(batch))
        for j, c := range batch {
            texts[j] = c.Content
        }

        embeddings, err := p.embedder.Embed(ctx, texts)
        if err != nil {
            return fmt.Errorf("embed batch %d: %w", i/batchSize, err)
        }

        if err := p.store.Insert(ctx, batch, embeddings); err != nil {
            return fmt.Errorf("insert batch %d: %w", i/batchSize, err)
        }

        fmt.Printf("Ingested %d/%d chunks\n", end, len(allChunks))
    }
    return nil
}

// Query handles a user query and returns the LLM-generated answer
func (p *Pipeline) Query(ctx context.Context, question string) (string, error) {
    embeddings, err := p.embedder.Embed(ctx, []string{question})
    if err != nil {
        return "", fmt.Errorf("embed query: %w", err)
    }

    results, err := p.store.HybridSearch(ctx, question, embeddings[0], 20)
    if err != nil {
        return "", fmt.Errorf("search: %w", err)
    }

    if len(results) == 0 {
        return "I couldn't find relevant information in the knowledge base to answer your question.", nil
    }

    if p.reranker != nil {
        results, err = p.reranker.Rerank(ctx, question, results, 5)
        if err != nil && len(results) > 5 {
            results = results[:5]
        }
    } else if len(results) > 5 {
        results = results[:5]
    }

    context := p.assembleContext(results)

    systemPrompt := `You are a helpful assistant. Answer the user's question based ONLY on the provided reference materials.
If the reference materials don't contain enough information, clearly say so.
Always cite the source when referencing information.`

    userMessage := fmt.Sprintf("Reference materials:\n\n%s\n\nQuestion: %s", context, question)
    return p.llm.Complete(ctx, systemPrompt, userMessage)
}

func (p *Pipeline) assembleContext(results []SearchResult) string {
    var sb strings.Builder
    for i, r := range results {
        sb.WriteString(fmt.Sprintf("[Source %d: %s]\n%s\n\n", i+1, r.Chunk.Source, r.Chunk.Content))
    }
    return sb.String()
}

L4: Advanced — Late Chunking, HyDE, RAGAS Evaluation, and Streaming RAG

Late Chunking

Traditional chunking occurs before embedding, causing the loss of contextual connections between chunks. Late Chunking first embeds the complete document, then divides in vector space:

// LateChunker implements late chunking
type LateChunker struct {
    // Use a long-context embedding model (e.g., jina-embeddings-v2)
    embedder LongContextEmbedder
}

// EmbedWithContext embeds the full document first, then chunks at the token level
func (c *LateChunker) EmbedWithContext(ctx context.Context, doc Document) ([]ChunkEmbedding, error) {
    // 1. Get token-level embeddings for the full document
    tokenEmbeddings, err := c.embedder.EmbedTokens(ctx, doc.Content)
    if err != nil {
        return nil, err
    }

    // 2. Determine chunk boundaries (use semantic boundaries)
    boundaries := findSemanticBoundaries(doc.Content)

    // 3. For each chunk, mean-pool its token embeddings
    var result []ChunkEmbedding
    for i := 0; i < len(boundaries)-1; i++ {
        start, end := boundaries[i], boundaries[i+1]
        chunkTokens := tokenEmbeddings[start:end]
        pooled := meanPool(chunkTokens)
        result = append(result, ChunkEmbedding{
            Content:   extractTextRange(doc.Content, start, end),
            Embedding: pooled,
        })
    }

    return result, nil
}

func meanPool(vectors [][]float32) []float32 {
    if len(vectors) == 0 {
        return nil
    }
    dim := len(vectors[0])
    result := make([]float32, dim)
    for _, v := range vectors {
        for i, f := range v {
            result[i] += f
        }
    }
    n := float32(len(vectors))
    for i := range result {
        result[i] /= n
    }
    return result
}

HyDE (Hypothetical Document Embeddings)

HyDE is an elegant technique: first ask the LLM to generate a "hypothetical document" based on the question, then use that hypothetical document's vector to search the real knowledge base.

The principle: a question's vector and an answer's vector may not be close in semantic space (e.g., "What is HNSW?" vs "HNSW is..."), but two answer-type texts are much more likely to be semantically near each other.

// HyDERetriever implements the HyDE retrieval strategy
type HyDERetriever struct {
    llm      LLMClient
    embedder Embedder
    store    *PgVectorStore
}

func (r *HyDERetriever) Search(ctx context.Context, question string, k int) ([]SearchResult, error) {
    // 1. Ask the LLM to generate a hypothetical document
    hypotheticalDoc, err := r.llm.Complete(ctx,
        "Generate a short, factual paragraph that would answer the following question. Write as if from a technical document.",
        question)
    if err != nil {
        return nil, fmt.Errorf("generate hypothetical doc: %w", err)
    }

    // 2. Embed the hypothetical document
    embeddings, err := r.embedder.Embed(ctx, []string{hypotheticalDoc})
    if err != nil {
        return nil, err
    }

    // 3. Search the real knowledge base with the hypothetical document's vector
    return r.store.Search(ctx, embeddings[0], k)
}

RAGAS Evaluation Framework

RAGAS (RAG Assessment) provides systematic metrics for evaluating RAG pipeline quality:

// RAGASEvaluator evaluates RAG pipeline quality
type RAGASEvaluator struct {
    llm      LLMClient
    embedder Embedder
}

// EvaluateFaithfulness evaluates whether the answer is faithful to retrieved content (0-1)
// Core question: can every claim in the answer be inferred from the context?
func (e *RAGASEvaluator) EvaluateFaithfulness(ctx context.Context, question, answer, context string) (float64, error) {
    prompt := fmt.Sprintf(`Given the following context and answer, evaluate whether each statement in the answer is supported by the context.

Context: %s

Answer: %s

For each statement in the answer, determine if it is:
1. Fully supported by the context
2. Partially supported
3. Not supported (hallucination)

Return a JSON object with:
- "statements": list of statements found in the answer
- "supported": list of booleans indicating support
- "score": ratio of fully supported statements (0.0 to 1.0)`, context, answer)

    response, err := e.llm.Complete(ctx, "You are an expert at evaluating RAG systems.", prompt)
    if err != nil {
        return 0, err
    }

    var result struct {
        Score float64 `json:"score"`
    }
    if err := extractJSON(response, &result); err != nil {
        return 0, err
    }
    return result.Score, nil
}

// EvaluateAnswerRelevancy evaluates how relevant the answer is to the question (0-1)
func (e *RAGASEvaluator) EvaluateAnswerRelevancy(ctx context.Context, question, answer string) (float64, error) {
    // Ask the LLM to reverse-generate questions from the answer, then measure similarity
    genQuestionsPrompt := fmt.Sprintf(`Generate 3 different questions that this answer could be responding to.
Answer: %s
Return just the questions, one per line.`, answer)

    response, err := e.llm.Complete(ctx, "Generate questions based on the given answer.", genQuestionsPrompt)
    if err != nil {
        return 0, err
    }

    generatedQuestions := strings.Split(strings.TrimSpace(response), "\n")
    texts := append([]string{question}, generatedQuestions...)

    embeddings, err := e.embedder.Embed(ctx, texts)
    if err != nil {
        return 0, err
    }

    questionEmb := embeddings[0]
    var totalSim float64
    for _, genEmb := range embeddings[1:] {
        totalSim += cosineSimilarity(questionEmb, genEmb)
    }
    return totalSim / float64(len(generatedQuestions)), nil
}

func cosineSimilarity(a, b []float32) float64 {
    var dot, normA, normB float64
    for i := range a {
        dot += float64(a[i]) * float64(b[i])
        normA += float64(a[i]) * float64(a[i])
        normB += float64(b[i]) * float64(b[i])
    }
    if normA == 0 || normB == 0 {
        return 0
    }
    return dot / (math.Sqrt(normA) * math.Sqrt(normB))
}

Streaming RAG Responses

For long answers, streaming output significantly improves user experience:

// StreamingQuery supports streaming output for RAG queries
func (p *Pipeline) StreamingQuery(ctx context.Context, question string, out chan<- string) error {
    defer close(out)

    // Retrieval step (synchronous)
    embeddings, err := p.embedder.Embed(ctx, []string{question})
    if err != nil {
        return err
    }
    results, err := p.store.Search(ctx, embeddings[0], 5)
    if err != nil {
        return err
    }
    ctx_text := p.assembleContext(results)

    // Stream generation
    return p.llm.StreamComplete(ctx,
        "Answer based on the provided context.",
        fmt.Sprintf("Context:\n%s\n\nQuestion: %s", ctx_text, question),
        out)
}

Summary

RAG pipeline performance is influenced by multiple factors. Recommended optimization order:

Optimize chunking strategy first: This is the single most impactful factor. Wrong chunking leads to retrieving incomplete context.
Choose the right embedding model: voyage-3 performs better on technical documentation.
Add hybrid search: BM25 is better for keyword search (names, model numbers, code snippets); vector search is better for semantics.
Add reranking: Cohere Rerank can significantly improve precision at controlled cost.
Finally optimize the prompt: How context is assembled and the system prompt both have measurable impact on final quality.
Evaluate with RAGAS: Build an evaluation dataset and quantify the improvement from each change.

RAG is not a one-time engineering project — it's a system that requires continuous iterative optimization.

Rate this chapter

4.5 / 5 (3 ratings)

RAG Pipeline: Retrieval-Augmented Generation

RAG Pipeline: Retrieval-Augmented Generation

L1: Concept — Why RAG Exists

The Three Fundamental Limitations of LLMs

RAG's Core Idea: Retrieve First, Then Generate

RAG vs Fine-Tuning: Which Should You Choose?

L2: Principles — The Two Stages of the RAG Pipeline

Stage 1: Ingestion Pipeline

Step 1: Load

Step 2: Chunk

Step 3: Embed

Step 4: Store

Stage 2: Retrieval and Generation

Step 1: Embed the Query

Step 2: Approximate Nearest Neighbor Search (ANN)

Step 3: Rerank

Step 4: Context Assembly and Generation

L3: Code Practice — Building a Complete Go RAG Pipeline

Document Loader

Recursive Text Splitter

Embedding API Client

pgvector Store

Complete RAG Pipeline

L4: Advanced — Late Chunking, HyDE, RAGAS Evaluation, and Streaming RAG

Late Chunking

HyDE (Hypothetical Document Embeddings)

RAGAS Evaluation Framework

Streaming RAG Responses

Summary

💬 Comments