RAG Pipeline: Retrieval-Augmented Generation
RAG Pipeline: Retrieval-Augmented Generation
L1: Concept โ Why RAG Exists
The Three Fundamental Limitations of LLMs
Large language models are a compressed representation of human knowledge, but this "compression" introduces three fundamental limitations that cannot be solved simply by making models larger.
Limitation 1: Knowledge Cutoff
LLM training data has a cutoff point in time. Claude doesn't know what happened in today's news, GPT-4 doesn't know about the latest API changes, and no model can access your company's internal business data. This isn't a bug โ it's an inherent property of any system based on training data.
Limitation 2: Private Data Is Invisible
Your company may have tens of thousands of internal documents, customer service records, technical documentation, meeting notes. None of this data has ever appeared in any public training dataset. Even if you use the most powerful LLM, it knows nothing about this data.
Limitation 3: Hallucination
When an LLM is asked about something it's uncertain of, it tends to generate content that sounds plausible but is actually wrong โ this is "hallucination." The root cause of hallucination is that an LLM's goal is to generate linguistically coherent, semantically plausible text, not to be strictly consistent with facts.
RAG's Core Idea: Retrieve First, Then Generate
RAG (Retrieval-Augmented Generation) is a strikingly direct idea:
Before asking the LLM a question, first retrieve the most relevant document fragments from a knowledge base, then send those fragments as context along with the question to the LLM, so the LLM can generate an answer based on that evidence.
This simple idea addresses all three limitations:
- The knowledge base can be updated at any time (solves the knowledge cutoff problem)
- Private data can be added to the knowledge base (solves the private data problem)
- The LLM answers based on real evidence, dramatically reducing hallucinations
From an information retrieval perspective, RAG combines a traditional search engine (retrieve relevant documents) with an LLM (understand and generate text). Each part does what it does best.
RAG vs Fine-Tuning: Which Should You Choose?
Beginners are often confused: why use RAG instead of directly fine-tuning the LLM on private data?
Fine-tuning advantages:
- The model can learn domain-specific language style and terminology
- Faster inference (no retrieval step)
- Good for teaching the model specific output formats or behaviors
Fine-tuning limitations:
- High cost to update: every knowledge base update requires re-fine-tuning, with enormous time and compute costs
- Cannot handle very large document corpora: fine-tuning only changes model "intuitions," it can't make a model "memorize" the specific contents of thousands of documents
- Hallucinations persist: fine-tuning cannot guarantee the model answers strictly based on training data
- Expensive: high-quality fine-tuning requires substantial GPU compute and professionally annotated data
RAG advantages:
- Knowledge base can be updated in real time (add new documents, delete outdated content)
- Answers are attributable (you can tell users "this answer comes from document X, page Y")
- Controlled cost (only maintain a vector database, no model retraining needed)
- More effective against hallucinations (the LLM is explicitly instructed to "answer based on the following")
Conclusion: Most enterprise AI applications should try RAG first. Only consider fine-tuning when RAG's performance is genuinely insufficient and you have adequate high-quality annotated data.
L2: Principles โ The Two Stages of the RAG Pipeline
Stage 1: Ingestion Pipeline
The ingestion pipeline transforms raw documents into a searchable vector index. This process has four steps:
Step 1: Load
Load raw documents in various formats:
PDF โ text extraction (preserving chapter structure)
Word โ paragraph extraction
Markdown โ use directly
HTML โ strip tags, preserve content
CSV/Excel โ convert row contents to text
Database records โ format as text
Document loading is a step that appears simple but is actually complex. The difficulties include:
- PDFs may be scanned images (require OCR)
- Documents may contain tables (how to preserve table semantics?)
- Some documents have special sectional structure (like legal contracts, technical manuals)
Step 2: Chunk
Split long documents into smaller chunks. This step is one of the most critical determinants of RAG performance.
Why chunk?
- Embedding models have token length limits (usually 512-8192 tokens)
- Retrieval returns chunks; chunks that are too large introduce irrelevant content, while chunks that are too small lose context
- Vector similarity works best on semantically cohesive text chunks
Three main chunking strategies:
Fixed-size chunking: Split into fixed token counts with overlap between adjacent chunks. Pros: simple to implement, good for documents without clear structure. Cons: may cut mid-sentence, disrupting semantic integrity.
Semantic chunking: Split at semantic boundaries (paragraphs, sections, sentences). Pros: preserves complete semantic units. Cons: uneven chunk sizes, complex to implement.
Recursive character chunking: Recursively split using a hierarchy of separators: ["\n\n", "\n", ". ", "! ", "? ", ", ", " ", ""]. Prefers paragraph boundaries, then sentence boundaries, then word boundaries. Balances fixed-size and semantic boundary approaches.
Step 3: Embed
Convert text chunks into dense vectors. This step uses an embedding model to map semantically similar text to nearby positions in vector space.
Comparison of popular embedding models:
| Model | Dimensions | Max Tokens | Notes |
|---|---|---|---|
| text-embedding-3-small | 1536 | 8191 | OpenAI, cost-effective |
| text-embedding-3-large | 3072 | 8191 | OpenAI, highest accuracy |
| voyage-3 | 1024 | 32000 | Anthropic recommended, great for long docs |
| nomic-embed-text | 768 | 8192 | Open source, locally deployable |
Step 4: Store
Store vectors in a vector database. Comparison of popular vector databases:
| Database | Type | Notes |
|---|---|---|
| pgvector | PostgreSQL extension | Seamless integration with existing PG stack, good for medium scale |
| Qdrant | Standalone service | Excellent performance, rich filtering |
| Chroma | Embedded | Simple to develop, good for local prototyping |
| Pinecone | Cloud service | Fully managed, good for large-scale production |
| Weaviate | Standalone service | Built-in hybrid search |
Stage 2: Retrieval and Generation
Step 1: Embed the Query
The user's question is converted to a vector using the same embedding model. The key here is: the query vector and document vectors must use the same model to be compared in the same vector space.
Step 2: Approximate Nearest Neighbor Search (ANN)
Find the K document chunks most similar to the query vector in the vector database. Similarity measures:
Cosine Similarity: The most common metric. Not affected by vector magnitude. Ideal for text embeddings.
$$\text{sim}(A, B) = \frac{A \cdot B}{|A| |B|}$$
Dot Product: Suitable for already-normalized vectors (equivalent to cosine similarity after normalization).
Euclidean Distance: Better suited for image embeddings and similar domains; less common for text.
Step 3: Rerank
Initial retrieval (using vector similarity) has limited precision. The reranking step uses a cross-encoder model to precisely re-rank retrieval results:
Bi-encoder (vector similarity): fast, but limited precision
Cross-encoder (reranking): slow, but higher precision
Workflow: use bi-encoder to quickly recall Top-50, then use cross-encoder to re-rank to Top-5
Popular reranking services: Cohere Rerank, Jina Reranker, BGE Reranker (open source).
Step 4: Context Assembly and Generation
Assemble retrieved document chunks into context, then send to the LLM:
[System Prompt]
You are a professional assistant. Answer the question based ONLY on the reference materials below.
If the reference materials do not contain relevant information, clearly say so.
[Reference Materials]
Source: Document A, Chapter 3
Content: ...
Source: Document B, Page 7
Content: ...
[User Question]
Please explain...
L3: Code Practice โ Building a Complete Go RAG Pipeline
Document Loader
// rag/loader.go
package rag
import (
"fmt"
"os"
"path/filepath"
"strings"
)
// Document represents a loaded document
type Document struct {
ID string
Content string
Source string
Metadata map[string]string
}
// MarkdownLoader loads Markdown files
type MarkdownLoader struct{}
func (l *MarkdownLoader) Load(path string) ([]Document, error) {
files, err := filepath.Glob(path)
if err != nil {
return nil, err
}
var docs []Document
for _, file := range files {
content, err := os.ReadFile(file)
if err != nil {
return nil, fmt.Errorf("read %s: %w", file, err)
}
docs = append(docs, Document{
ID: file,
Content: string(content),
Source: file,
Metadata: map[string]string{
"filename": filepath.Base(file),
"type": "markdown",
},
})
}
return docs, nil
}
// DirectoryLoader recursively loads all documents in a directory
type DirectoryLoader struct {
Extensions []string
}
func (l *DirectoryLoader) Load(dirPath string) ([]Document, error) {
var docs []Document
err := filepath.Walk(dirPath, func(path string, info os.FileInfo, err error) error {
if err != nil || info.IsDir() {
return err
}
ext := strings.ToLower(filepath.Ext(path))
for _, supported := range l.Extensions {
if ext == supported {
content, err := os.ReadFile(path)
if err != nil {
return err
}
docs = append(docs, Document{
ID: path,
Content: string(content),
Source: path,
Metadata: map[string]string{
"filename": info.Name(),
"type": ext[1:],
},
})
break
}
}
return nil
})
return docs, err
}
Recursive Text Splitter
// rag/splitter.go
package rag
import (
"strings"
"unicode/utf8"
)
// Chunk represents a text chunk
type Chunk struct {
Content string
Source string
ChunkIdx int
Metadata map[string]string
}
// RecursiveTextSplitter implements recursive character splitting
type RecursiveTextSplitter struct {
ChunkSize int
ChunkOverlap int
Separators []string
}
func NewRecursiveTextSplitter(chunkSize, overlap int) *RecursiveTextSplitter {
return &RecursiveTextSplitter{
ChunkSize: chunkSize,
ChunkOverlap: overlap,
Separators: []string{"\n\n", "\n", ". ", "! ", "? ", ", ", " ", ""},
}
}
func (s *RecursiveTextSplitter) Split(doc Document) []Chunk {
texts := s.splitText(doc.Content, s.Separators)
var chunks []Chunk
for i, text := range texts {
if strings.TrimSpace(text) == "" {
continue
}
chunks = append(chunks, Chunk{
Content: text,
Source: doc.Source,
ChunkIdx: i,
Metadata: doc.Metadata,
})
}
return chunks
}
func (s *RecursiveTextSplitter) splitText(text string, separators []string) []string {
if len(separators) == 0 || utf8.RuneCountInString(text) <= s.ChunkSize {
return []string{text}
}
separator := separators[len(separators)-1]
for _, sep := range separators {
if strings.Contains(text, sep) {
separator = sep
break
}
}
var goodSplits []string
var currentChunk strings.Builder
splits := strings.Split(text, separator)
for _, split := range splits {
if currentChunk.Len()+len(split)+len(separator) <= s.ChunkSize*4 {
if currentChunk.Len() > 0 {
currentChunk.WriteString(separator)
}
currentChunk.WriteString(split)
} else {
if currentChunk.Len() > 0 {
goodSplits = append(goodSplits, currentChunk.String())
overlap := s.getOverlap(currentChunk.String())
currentChunk.Reset()
currentChunk.WriteString(overlap)
if currentChunk.Len() > 0 {
currentChunk.WriteString(separator)
}
}
currentChunk.WriteString(split)
}
}
if currentChunk.Len() > 0 {
goodSplits = append(goodSplits, currentChunk.String())
}
var result []string
for _, gs := range goodSplits {
if utf8.RuneCountInString(gs) > s.ChunkSize*4 {
result = append(result, s.splitText(gs, separators[1:])...)
} else {
result = append(result, gs)
}
}
return result
}
func (s *RecursiveTextSplitter) getOverlap(text string) string {
runes := []rune(text)
overlapChars := s.ChunkOverlap * 4
if len(runes) <= overlapChars {
return text
}
return string(runes[len(runes)-overlapChars:])
}
Embedding API Client
// rag/embedder.go
package rag
import (
"bytes"
"context"
"encoding/json"
"fmt"
"net/http"
"os"
"time"
)
// Embedder converts text into vectors
type Embedder interface {
Embed(ctx context.Context, texts []string) ([][]float32, error)
}
// OpenAIEmbedder uses the OpenAI Embeddings API
type OpenAIEmbedder struct {
apiKey string
model string
client *http.Client
}
func NewOpenAIEmbedder(model string) *OpenAIEmbedder {
return &OpenAIEmbedder{
apiKey: os.Getenv("OPENAI_API_KEY"),
model: model,
client: &http.Client{Timeout: 30 * time.Second},
}
}
func (e *OpenAIEmbedder) Embed(ctx context.Context, texts []string) ([][]float32, error) {
body, _ := json.Marshal(map[string]interface{}{
"model": e.model,
"input": texts,
})
req, err := http.NewRequestWithContext(ctx, "POST",
"https://api.openai.com/v1/embeddings", bytes.NewReader(body))
if err != nil {
return nil, err
}
req.Header.Set("Authorization", "Bearer "+e.apiKey)
req.Header.Set("Content-Type", "application/json")
resp, err := e.client.Do(req)
if err != nil {
return nil, fmt.Errorf("embedding request: %w", err)
}
defer resp.Body.Close()
var result struct {
Data []struct {
Embedding []float32 `json:"embedding"`
} `json:"data"`
}
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
return nil, err
}
embeddings := make([][]float32, len(result.Data))
for i, d := range result.Data {
embeddings[i] = d.Embedding
}
return embeddings, nil
}
// VoyageEmbedder uses the Voyage AI Embeddings API (recommended by Anthropic)
type VoyageEmbedder struct {
apiKey string
model string
client *http.Client
}
func NewVoyageEmbedder(model string) *VoyageEmbedder {
return &VoyageEmbedder{
apiKey: os.Getenv("VOYAGE_API_KEY"),
model: model,
client: &http.Client{Timeout: 30 * time.Second},
}
}
func (e *VoyageEmbedder) Embed(ctx context.Context, texts []string) ([][]float32, error) {
body, _ := json.Marshal(map[string]interface{}{
"model": e.model,
"input": texts,
})
req, err := http.NewRequestWithContext(ctx, "POST",
"https://api.voyageai.com/v1/embeddings", bytes.NewReader(body))
if err != nil {
return nil, err
}
req.Header.Set("Authorization", "Bearer "+e.apiKey)
req.Header.Set("Content-Type", "application/json")
resp, err := e.client.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
var result struct {
Data []struct {
Embedding []float32 `json:"embedding"`
} `json:"data"`
}
if err := json.NewDecoder(resp.Body).Decode(&result); err != nil {
return nil, err
}
embeddings := make([][]float32, len(result.Data))
for i, d := range result.Data {
embeddings[i] = d.Embedding
}
return embeddings, nil
}
pgvector Store
// rag/store.go
package rag
import (
"context"
"database/sql"
"fmt"
"strings"
_ "github.com/lib/pq"
)
// SearchResult represents a retrieval result
type SearchResult struct {
Chunk Chunk
Similarity float64
}
// PgVectorStore uses pgvector for vector storage and retrieval
type PgVectorStore struct {
db *sql.DB
tableName string
dimension int
}
func NewPgVectorStore(dsn, tableName string, dimension int) (*PgVectorStore, error) {
db, err := sql.Open("postgres", dsn)
if err != nil {
return nil, err
}
store := &PgVectorStore{db: db, tableName: tableName, dimension: dimension}
if err := store.initialize(context.Background()); err != nil {
return nil, err
}
return store, nil
}
func (s *PgVectorStore) initialize(ctx context.Context) error {
queries := []string{
"CREATE EXTENSION IF NOT EXISTS vector",
fmt.Sprintf(`CREATE TABLE IF NOT EXISTS %s (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
source TEXT,
chunk_idx INTEGER,
metadata JSONB DEFAULT '{}',
embedding vector(%d)
)`, s.tableName, s.dimension),
fmt.Sprintf(`CREATE INDEX IF NOT EXISTS %s_embedding_idx
ON %s USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100)`, s.tableName, s.tableName),
}
for _, q := range queries {
if _, err := s.db.ExecContext(ctx, q); err != nil {
return fmt.Errorf("initialize: %w", err)
}
}
return nil
}
// Insert batch-inserts document chunks with their embeddings
func (s *PgVectorStore) Insert(ctx context.Context, chunks []Chunk, embeddings [][]float32) error {
if len(chunks) != len(embeddings) {
return fmt.Errorf("chunks and embeddings length mismatch")
}
tx, err := s.db.BeginTx(ctx, nil)
if err != nil {
return err
}
defer tx.Rollback()
stmt, err := tx.PrepareContext(ctx, fmt.Sprintf(
`INSERT INTO %s (content, source, chunk_idx, embedding) VALUES ($1, $2, $3, $4)`,
s.tableName))
if err != nil {
return err
}
defer stmt.Close()
for i, chunk := range chunks {
embStr := float32SliceToString(embeddings[i])
if _, err := stmt.ExecContext(ctx, chunk.Content, chunk.Source, chunk.ChunkIdx, embStr); err != nil {
return fmt.Errorf("insert chunk %d: %w", i, err)
}
}
return tx.Commit()
}
// Search retrieves the K most similar chunks using cosine similarity
func (s *PgVectorStore) Search(ctx context.Context, queryEmbedding []float32, k int) ([]SearchResult, error) {
embStr := float32SliceToString(queryEmbedding)
rows, err := s.db.QueryContext(ctx, fmt.Sprintf(
`SELECT content, source, chunk_idx,
1 - (embedding <=> $1::vector) AS similarity
FROM %s
ORDER BY embedding <=> $1::vector
LIMIT $2`, s.tableName),
embStr, k)
if err != nil {
return nil, fmt.Errorf("search: %w", err)
}
defer rows.Close()
var results []SearchResult
for rows.Next() {
var r SearchResult
if err := rows.Scan(&r.Chunk.Content, &r.Chunk.Source,
&r.Chunk.ChunkIdx, &r.Similarity); err != nil {
return nil, err
}
results = append(results, r)
}
return results, rows.Err()
}
// HybridSearch combines vector search and full-text search (BM25) using RRF
func (s *PgVectorStore) HybridSearch(ctx context.Context, query string, queryEmbedding []float32, k int) ([]SearchResult, error) {
embStr := float32SliceToString(queryEmbedding)
rows, err := s.db.QueryContext(ctx, fmt.Sprintf(
`WITH vector_search AS (
SELECT id, 1 - (embedding <=> $1::vector) AS vector_score
FROM %s ORDER BY embedding <=> $1::vector LIMIT 50
),
text_search AS (
SELECT id, ts_rank(to_tsvector('english', content), plainto_tsquery('english', $2)) AS text_score
FROM %s WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2) LIMIT 50
),
rrf AS (
SELECT COALESCE(v.id, t.id) AS id,
COALESCE(1.0/(60+ROW_NUMBER() OVER (ORDER BY v.vector_score DESC)), 0) +
COALESCE(1.0/(60+ROW_NUMBER() OVER (ORDER BY t.text_score DESC)), 0) AS rrf_score
FROM vector_search v FULL OUTER JOIN text_search t ON v.id = t.id
)
SELECT d.content, d.source, d.chunk_idx, r.rrf_score
FROM rrf r JOIN %s d ON r.id = d.id
ORDER BY r.rrf_score DESC LIMIT $3`,
s.tableName, s.tableName, s.tableName),
embStr, query, k)
if err != nil {
return nil, err
}
defer rows.Close()
var results []SearchResult
for rows.Next() {
var r SearchResult
if err := rows.Scan(&r.Chunk.Content, &r.Chunk.Source,
&r.Chunk.ChunkIdx, &r.Similarity); err != nil {
return nil, err
}
results = append(results, r)
}
return results, rows.Err()
}
func float32SliceToString(v []float32) string {
parts := make([]string, len(v))
for i, f := range v {
parts[i] = fmt.Sprintf("%f", f)
}
return "[" + strings.Join(parts, ",") + "]"
}
Complete RAG Pipeline
// rag/pipeline.go
package rag
import (
"context"
"fmt"
"strings"
)
// Pipeline is the complete RAG pipeline
type Pipeline struct {
splitter *RecursiveTextSplitter
embedder Embedder
store *PgVectorStore
reranker Reranker
llm LLMClient
}
type Reranker interface {
Rerank(ctx context.Context, query string, docs []SearchResult, topK int) ([]SearchResult, error)
}
type LLMClient interface {
Complete(ctx context.Context, systemPrompt, userMessage string) (string, error)
StreamComplete(ctx context.Context, systemPrompt, userMessage string, out chan<- string) error
}
func NewPipeline(embedder Embedder, store *PgVectorStore, reranker Reranker, llm LLMClient) *Pipeline {
return &Pipeline{
splitter: NewRecursiveTextSplitter(1000, 200),
embedder: embedder,
store: store,
reranker: reranker,
llm: llm,
}
}
// Ingest ingests documents into the knowledge base
func (p *Pipeline) Ingest(ctx context.Context, docs []Document) error {
var allChunks []Chunk
for _, doc := range docs {
chunks := p.splitter.Split(doc)
allChunks = append(allChunks, chunks...)
}
batchSize := 100
for i := 0; i < len(allChunks); i += batchSize {
end := i + batchSize
if end > len(allChunks) {
end = len(allChunks)
}
batch := allChunks[i:end]
texts := make([]string, len(batch))
for j, c := range batch {
texts[j] = c.Content
}
embeddings, err := p.embedder.Embed(ctx, texts)
if err != nil {
return fmt.Errorf("embed batch %d: %w", i/batchSize, err)
}
if err := p.store.Insert(ctx, batch, embeddings); err != nil {
return fmt.Errorf("insert batch %d: %w", i/batchSize, err)
}
fmt.Printf("Ingested %d/%d chunks\n", end, len(allChunks))
}
return nil
}
// Query handles a user query and returns the LLM-generated answer
func (p *Pipeline) Query(ctx context.Context, question string) (string, error) {
embeddings, err := p.embedder.Embed(ctx, []string{question})
if err != nil {
return "", fmt.Errorf("embed query: %w", err)
}
results, err := p.store.HybridSearch(ctx, question, embeddings[0], 20)
if err != nil {
return "", fmt.Errorf("search: %w", err)
}
if len(results) == 0 {
return "I couldn't find relevant information in the knowledge base to answer your question.", nil
}
if p.reranker != nil {
results, err = p.reranker.Rerank(ctx, question, results, 5)
if err != nil && len(results) > 5 {
results = results[:5]
}
} else if len(results) > 5 {
results = results[:5]
}
context := p.assembleContext(results)
systemPrompt := `You are a helpful assistant. Answer the user's question based ONLY on the provided reference materials.
If the reference materials don't contain enough information, clearly say so.
Always cite the source when referencing information.`
userMessage := fmt.Sprintf("Reference materials:\n\n%s\n\nQuestion: %s", context, question)
return p.llm.Complete(ctx, systemPrompt, userMessage)
}
func (p *Pipeline) assembleContext(results []SearchResult) string {
var sb strings.Builder
for i, r := range results {
sb.WriteString(fmt.Sprintf("[Source %d: %s]\n%s\n\n", i+1, r.Chunk.Source, r.Chunk.Content))
}
return sb.String()
}
L4: Advanced โ Late Chunking, HyDE, RAGAS Evaluation, and Streaming RAG
Late Chunking
Traditional chunking occurs before embedding, causing the loss of contextual connections between chunks. Late Chunking first embeds the complete document, then divides in vector space:
// LateChunker implements late chunking
type LateChunker struct {
// Use a long-context embedding model (e.g., jina-embeddings-v2)
embedder LongContextEmbedder
}
// EmbedWithContext embeds the full document first, then chunks at the token level
func (c *LateChunker) EmbedWithContext(ctx context.Context, doc Document) ([]ChunkEmbedding, error) {
// 1. Get token-level embeddings for the full document
tokenEmbeddings, err := c.embedder.EmbedTokens(ctx, doc.Content)
if err != nil {
return nil, err
}
// 2. Determine chunk boundaries (use semantic boundaries)
boundaries := findSemanticBoundaries(doc.Content)
// 3. For each chunk, mean-pool its token embeddings
var result []ChunkEmbedding
for i := 0; i < len(boundaries)-1; i++ {
start, end := boundaries[i], boundaries[i+1]
chunkTokens := tokenEmbeddings[start:end]
pooled := meanPool(chunkTokens)
result = append(result, ChunkEmbedding{
Content: extractTextRange(doc.Content, start, end),
Embedding: pooled,
})
}
return result, nil
}
func meanPool(vectors [][]float32) []float32 {
if len(vectors) == 0 {
return nil
}
dim := len(vectors[0])
result := make([]float32, dim)
for _, v := range vectors {
for i, f := range v {
result[i] += f
}
}
n := float32(len(vectors))
for i := range result {
result[i] /= n
}
return result
}
HyDE (Hypothetical Document Embeddings)
HyDE is an elegant technique: first ask the LLM to generate a "hypothetical document" based on the question, then use that hypothetical document's vector to search the real knowledge base.
The principle: a question's vector and an answer's vector may not be close in semantic space (e.g., "What is HNSW?" vs "HNSW is..."), but two answer-type texts are much more likely to be semantically near each other.
// HyDERetriever implements the HyDE retrieval strategy
type HyDERetriever struct {
llm LLMClient
embedder Embedder
store *PgVectorStore
}
func (r *HyDERetriever) Search(ctx context.Context, question string, k int) ([]SearchResult, error) {
// 1. Ask the LLM to generate a hypothetical document
hypotheticalDoc, err := r.llm.Complete(ctx,
"Generate a short, factual paragraph that would answer the following question. Write as if from a technical document.",
question)
if err != nil {
return nil, fmt.Errorf("generate hypothetical doc: %w", err)
}
// 2. Embed the hypothetical document
embeddings, err := r.embedder.Embed(ctx, []string{hypotheticalDoc})
if err != nil {
return nil, err
}
// 3. Search the real knowledge base with the hypothetical document's vector
return r.store.Search(ctx, embeddings[0], k)
}
RAGAS Evaluation Framework
RAGAS (RAG Assessment) provides systematic metrics for evaluating RAG pipeline quality:
// RAGASEvaluator evaluates RAG pipeline quality
type RAGASEvaluator struct {
llm LLMClient
embedder Embedder
}
// EvaluateFaithfulness evaluates whether the answer is faithful to retrieved content (0-1)
// Core question: can every claim in the answer be inferred from the context?
func (e *RAGASEvaluator) EvaluateFaithfulness(ctx context.Context, question, answer, context string) (float64, error) {
prompt := fmt.Sprintf(`Given the following context and answer, evaluate whether each statement in the answer is supported by the context.
Context: %s
Answer: %s
For each statement in the answer, determine if it is:
1. Fully supported by the context
2. Partially supported
3. Not supported (hallucination)
Return a JSON object with:
- "statements": list of statements found in the answer
- "supported": list of booleans indicating support
- "score": ratio of fully supported statements (0.0 to 1.0)`, context, answer)
response, err := e.llm.Complete(ctx, "You are an expert at evaluating RAG systems.", prompt)
if err != nil {
return 0, err
}
var result struct {
Score float64 `json:"score"`
}
if err := extractJSON(response, &result); err != nil {
return 0, err
}
return result.Score, nil
}
// EvaluateAnswerRelevancy evaluates how relevant the answer is to the question (0-1)
func (e *RAGASEvaluator) EvaluateAnswerRelevancy(ctx context.Context, question, answer string) (float64, error) {
// Ask the LLM to reverse-generate questions from the answer, then measure similarity
genQuestionsPrompt := fmt.Sprintf(`Generate 3 different questions that this answer could be responding to.
Answer: %s
Return just the questions, one per line.`, answer)
response, err := e.llm.Complete(ctx, "Generate questions based on the given answer.", genQuestionsPrompt)
if err != nil {
return 0, err
}
generatedQuestions := strings.Split(strings.TrimSpace(response), "\n")
texts := append([]string{question}, generatedQuestions...)
embeddings, err := e.embedder.Embed(ctx, texts)
if err != nil {
return 0, err
}
questionEmb := embeddings[0]
var totalSim float64
for _, genEmb := range embeddings[1:] {
totalSim += cosineSimilarity(questionEmb, genEmb)
}
return totalSim / float64(len(generatedQuestions)), nil
}
func cosineSimilarity(a, b []float32) float64 {
var dot, normA, normB float64
for i := range a {
dot += float64(a[i]) * float64(b[i])
normA += float64(a[i]) * float64(a[i])
normB += float64(b[i]) * float64(b[i])
}
if normA == 0 || normB == 0 {
return 0
}
return dot / (math.Sqrt(normA) * math.Sqrt(normB))
}
Streaming RAG Responses
For long answers, streaming output significantly improves user experience:
// StreamingQuery supports streaming output for RAG queries
func (p *Pipeline) StreamingQuery(ctx context.Context, question string, out chan<- string) error {
defer close(out)
// Retrieval step (synchronous)
embeddings, err := p.embedder.Embed(ctx, []string{question})
if err != nil {
return err
}
results, err := p.store.Search(ctx, embeddings[0], 5)
if err != nil {
return err
}
ctx_text := p.assembleContext(results)
// Stream generation
return p.llm.StreamComplete(ctx,
"Answer based on the provided context.",
fmt.Sprintf("Context:\n%s\n\nQuestion: %s", ctx_text, question),
out)
}
Summary
RAG pipeline performance is influenced by multiple factors. Recommended optimization order:
- Optimize chunking strategy first: This is the single most impactful factor. Wrong chunking leads to retrieving incomplete context.
- Choose the right embedding model: voyage-3 performs better on technical documentation.
- Add hybrid search: BM25 is better for keyword search (names, model numbers, code snippets); vector search is better for semantics.
- Add reranking: Cohere Rerank can significantly improve precision at controlled cost.
- Finally optimize the prompt: How context is assembled and the system prompt both have measurable impact on final quality.
- Evaluate with RAGAS: Build an evaluation dataset and quantify the improvement from each change.
RAG is not a one-time engineering project โ it's a system that requires continuous iterative optimization.