Chapter 5

RAG Deep Dive: Vector Search vs Full-Text vs Hybrid Retrieval

Chapter 5: RAG Principles in Depth — Vector Retrieval vs. Full-Text Retrieval vs. Hybrid Retrieval

Knowing what RAG is isn't enough. Understanding why different retrieval methods perform so differently across scenarios is what enables you to build a truly high-quality knowledge Q&A system.

Chapter Overview

"Why did the AI give a wrong answer to such an obvious question?" — This is the most common confusion after launching a knowledge base application. In most cases, the problem isn't the model's reasoning ability but the retrieval step: either the wrong document chunks were retrieved, or the right ones were missed.

RAG (Retrieval-Augmented Generation) is the technical foundation for Dify's knowledge base functionality. To tune RAG well, you need to truly understand the underlying retrieval principles: How does vector retrieval work? Why is exact keyword matching sometimes better? How does hybrid retrieval combine the advantages of both?

This chapter explains the principles and appropriate use cases for three retrieval methods at the algorithmic level, combined with Dify's configuration parameters to provide actionable tuning recommendations.

By the end of this chapter, you will be able to:


Level 1: Foundational Understanding (1-3 Years Experience)

What Is RAG?

The core idea of RAG (Retrieval-Augmented Generation) is simple yet powerful: Before asking an LLM to generate an answer, retrieve document chunks relevant to the question from a knowledge base, then pass both these chunks and the question to the LLM, letting it answer based on this "reference material".

Without RAG, an LLM can only rely on knowledge learned during training (which has time limits and coverage gaps). With RAG, the LLM is like a student who can consult reference books during an exam — it doesn't need to "memorize" all knowledge, just "understand" the retrieved chunks and give correct answers.

The basic RAG pipeline:

User question: What is the data export limit for the product?
         ↓
    [Retrieval Phase]
    1. Vectorize the question (embedding)
    2. Search the vector database for most similar document chunks
    3. Return Top-K most relevant chunks
         ↓
    Retrieval results:
    "[Chunk 1] Maximum 100,000 records per export..."
    "[Chunk 2] Export jobs run asynchronously in the background..."
         ↓
    [Generation Phase]
    Build Prompt:
    "Answer the user's question based on the following documentation:
    [Retrieved chunk 1, chunk 2]
    User question: What is the data export limit?"
         ↓
    LLM generates answer:
    "According to the product documentation, a maximum of 100,000 records
    can be exported at once. When exceeding this limit, export by time range..."

Intuitive Understanding of Three Retrieval Methods

Analogy: Imagine each piece of text as a point in high-dimensional space, where semantically similar text pieces are close together and unrelated text pieces are far apart. Retrieval is finding the points nearest to the query point.

Advantage: Understands semantics — "Apple phone" can match "iPhone" Disadvantage: Poor at exact matching — "Contract SH-2024-001" may not find the right document

Best use cases:

Analogy: Like Ctrl+F search, but smarter — it knows "run" and "running" share the same root, and can handle synonym substitution.

Advantage: Exact keyword matching, great for codes, numbers, and proper nouns Disadvantage: Cannot understand semantics — synonyms won't match

Best use cases:

Both methods run simultaneously, with an algorithm (like RRF) merging results. In most real-world scenarios, hybrid retrieval significantly outperforms either method alone.

Why hybrid is better: Real user questions often have both semantic understanding needs (describing concepts with different words) and exact matching needs (referencing specific numbers and names). Hybrid retrieval handles both.

Overview of Retrieval Configuration in Dify

In the Dify knowledge base settings, you'll see these configuration options:

Retrieval Settings
├── Retrieval Mode
│   ├── Semantic Search (Vector)
│   ├── Full-text Search
│   └── Hybrid Search ← Recommended
├── Top K: 5 (recall top N chunks)
├── Score Threshold: 0.5 (discard chunks below this)
└── Reranking
    ├── Enable/Disable
    └── Select Rerank model

Beginner recommendations:


Level 2: Mechanism Deep Dive (3-5 Years Experience)

The Mathematical Principles of Vector Retrieval

Text Vectorization (Embedding)

Embedding models transform text into fixed-dimension float vectors, such that semantically similar text is close together in vector space.

Using text-embedding-3-small as an example, it outputs 1536-dimensional vectors:

from openai import OpenAI
client = OpenAI()

# Convert text to a vector
def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding  # Returns a 1536-dimensional vector

# Example
vector_1 = get_embedding("What formats does data export support?")
vector_2 = get_embedding("What formats can I use to export data?")
vector_3 = get_embedding("What is the company registration process?")

# vector_1 and vector_2 are semantically similar — their vector distance is small
# vector_1 and vector_3 are semantically unrelated — their vector distance is large

Similarity Calculation: Cosine Similarity

Vector retrieval uses Cosine Similarity to measure the semantic similarity between two vectors:

import numpy as np

def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
    """
    Calculate cosine similarity between two vectors
    Returns value in [-1, 1], closer to 1 means more similar
    """
    a = np.array(vec_a)
    b = np.array(vec_b)
    
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    
    if norm_a == 0 or norm_b == 0:
        return 0.0
    
    return dot_product / (norm_a * norm_b)

# Actual test data (approximate values):
# cosine_similarity("data export format", "what format to export data") ≈ 0.92
# cosine_similarity("data export format", "company registration process") ≈ 0.21

Vector Indexing: Approximate Nearest Neighbor (ANN)

Document libraries typically have tens of thousands to millions of vectors. Computing similarity between the query vector and every document vector (brute-force search) would be too slow.

In practice, Approximate Nearest Neighbor (ANN) algorithms are used — sacrificing a small amount of precision for dramatically faster search speeds:

Main ANN Algorithm Comparison:

HNSW (Hierarchical Navigable Small World) — Weaviate default
  Advantages: Fast search, high precision
  Disadvantages: Slow index building, high memory consumption
  Best for: Document libraries under ~1 million entries

IVF (Inverted File Index) — Common in Milvus
  Advantages: Low memory consumption
  Disadvantages: Requires a training phase
  Best for: Document libraries over 10 million entries

FAISS-HNSW — Used at Google-scale
  Advantages: Engineering-grade optimization
  Disadvantages: Requires deploying FAISS separately
  Best for: Massive-scale retrieval scenarios

Dify uses Weaviate by default, which uses HNSW indexes internally. For most enterprise knowledge bases (< 1 million document chunks), this is a reasonable choice.

BM25 Full-Text Retrieval Algorithm

BM25 (Best Match 25) is a classic information retrieval algorithm and the default relevance algorithm in Elasticsearch.

BM25 core formula:

score(D, Q) = Σ IDF(qi) × TF(qi, D) × (k1 + 1) / (TF(qi, D) + k1 × (1 - b + b × |D|/avgdl))

Where:
- D: document
- Q: query terms (q1, q2, ...)
- TF(qi, D): frequency of term qi in document D
- IDF(qi): Inverse Document Frequency — measures term "rarity" (rarer terms get higher weight)
- |D|: document length
- avgdl: average document length across all documents
- k1, b: tuning parameters (k1 typically 1.2-2.0, b typically 0.75)

Intuitive understanding of IDF (Inverse Document Frequency):

If the word "the" appears in 99% of documents, its IDF is very low (carries almost no information). If "SH-2024-001" appears in only 1 document, its IDF is very high (highly discriminative). This is why BM25 performs well at matching specific proper nouns — these terms have high IDF values and thus high ranking weight.

Using BM25 in Python:

from rank_bm25 import BM25Okapi

# Document collection
documents = [
    "Data export supports three formats: CSV, Excel, and JSON",
    "Maximum 100,000 records per single export",
    "Export jobs run asynchronously in the background, email notification on completion",
    "Account registration with email or phone number",
]

# Tokenize (for English, simple word splitting works)
tokenized_docs = [doc.lower().split() for doc in documents]

# Build BM25 index
bm25 = BM25Okapi(tokenized_docs)

# Query
query = "data export format"
tokenized_query = query.lower().split()
scores = bm25.get_scores(tokenized_query)

# Results (higher score = more relevant):
# Document 0 (contains "data", "export", "format"): Highest score
# Document 1 (contains "export"): Medium score
# Document 3 (no relevant terms): Lowest score

Hybrid Search: The RRF Algorithm

Hybrid search needs to merge vector search results and full-text search results. Dify uses RRF (Reciprocal Rank Fusion):

def reciprocal_rank_fusion(
    results_list: list[list[Document]],
    k: int = 60
) -> list[Document]:
    """
    RRF algorithm: merge multiple ranked lists
    
    k: smoothing parameter (typically 60, prevents first-rank from being too dominant)
    """
    
    # Track each document's RRF score
    rrf_scores: dict[str, float] = {}
    
    for ranked_list in results_list:
        for rank, doc in enumerate(ranked_list, start=1):
            doc_id = doc.metadata["chunk_id"]
            if doc_id not in rrf_scores:
                rrf_scores[doc_id] = 0.0
            # RRF formula: 1 / (k + rank)
            rrf_scores[doc_id] += 1.0 / (k + rank)
    
    # Sort by RRF score
    sorted_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    
    # Return re-ranked document list
    doc_map = {doc.metadata["chunk_id"]: doc for rl in results_list for doc in rl}
    return [doc_map[doc_id] for doc_id, _ in sorted_docs]

# Example:
vector_results = [doc_A, doc_B, doc_C, doc_D]   # Vector retrieval, sorted by similarity
keyword_results = [doc_C, doc_A, doc_E, doc_B]   # Keyword retrieval, sorted by BM25 score

# RRF calculation:
# doc_A: 1/(60+1) + 1/(60+2) = 0.0164 + 0.0161 = 0.0325 (high in both lists)
# doc_C: 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323 (high in both lists)
# doc_B: 1/(60+2) + 1/(60+4) = 0.0161 + 0.0156 = 0.0317
# doc_D: 1/(60+4) + 0 = 0.0156 (only in vector list)
# doc_E: 0 + 1/(60+3) = 0.0159 (only in keyword list)

merged_results = reciprocal_rank_fusion([vector_results, keyword_results])
# Final order: doc_A, doc_C, doc_B, doc_E, doc_D

Why RRF is better than weighted averaging:

The problem with weighted averaging: you need to manually set weights (e.g., 70% vector + 30% keyword), and this weight is hard to determine — the optimal weight varies by query type.

RRF advantages: Only considers rank, not the raw scores from each retrieval method. Different methods return scores in very different ranges (vector similarity: 0-1; BM25 score: any positive number). RRF eliminates this inconsistency by normalizing to "rank."

Reranker: The Value of Precision Ranking

The retrieval pipeline typically has two phases:

Coarse Retrieval (Recall): Quickly recall candidate set from many documents (vector/keyword/hybrid)
       ↓
Fine Ranking (Rerank): Re-sort candidates with high precision

A Reranker is a model specifically trained to judge "how relevant a document is to a query." Its core difference from Embedding models:

Embedding Model: Independently encodes query and document, measures relevance via vector distance
              Query → [Vector A]
              Doc   → [Vector B]
              Similarity = cosine(A, B)

Reranker Model: Concatenates query and document, directly judges relevance (Cross-Encoder)
              [Query + Doc] → [Relevance score 0-1]
              
              Advantage: Can capture interaction information between query and document, higher precision
              Disadvantage: Requires separate inference for each candidate document — slow
                           (not suitable for large-scale coarse retrieval)

Reranker models supported in Dify:

Model Provider Characteristics Use Cases
bge-reranker-v2-m3 Local deployment Strong multilingual support First choice
cohere-rerank-3 Cohere Multilingual, commercial-grade Commercial projects
Jina Reranker v2 Jina AI Multilingual Alternative

Actual Reranker effect (Recall@3 metric in knowledge Q&A scenarios):

Vector search only:                    78.3%
Vector search + Reranker:              85.1%
Hybrid search:                         83.7%
Hybrid search + Reranker:              89.4%  ← Usually optimal

Level 3: Source Code and Principles (5+ Years Experience)

Weaviate Vector Storage and Retrieval

Dify uses Weaviate as the default vector database. Each knowledge base in Weaviate corresponds to a "Collection," with document chunks stored as objects:

# Core logic for Dify storing document chunks to Weaviate (simplified)
import weaviate

client = weaviate.Client("http://localhost:8080")

# Each knowledge base corresponds to a Weaviate class
collection_name = f"Dataset_{dataset_id.replace('-', '_')}"

# Store a document chunk
client.data_object.create(
    data_object={
        "text": chunk_text,          # Original text
        "doc_id": document_id,       # Document ID
        "doc_hash": document_hash,   # Document content hash (for deduplication)
        "dataset_id": dataset_id,    # Knowledge base ID
        "index_node_id": chunk_id,   # Chunk unique ID
        "index_node_hash": chunk_hash,
    },
    class_name=collection_name,
    vector=embedding_vector,  # Vector (generated by Embedding model)
    uuid=chunk_uuid
)

# Vector retrieval
results = client.query.get(
    collection_name,
    ["text", "doc_id", "index_node_id"]
).with_near_vector({
    "vector": query_vector
}).with_limit(top_k).with_additional(
    ["certainty", "distance"]  # Return similarity information
).do()

Weaviate HNSW index configuration (configurable at knowledge base creation):

# HNSW parameters and their performance impact
collection_config = {
    "class": collection_name,
    "vectorizer": "none",  # Use custom vectors, not Weaviate's built-in vectorizer
    "vectorIndexConfig": {
        "distance": "cosine",
        "ef": 64,             # Candidates scanned during search (larger = more accurate but slower)
        "efConstruction": 128, # Candidates during index building (affects index quality)
        "maxConnections": 64,  # Max connections per node (balances memory vs. speed)
        "dynamicEfMin": 25,
        "dynamicEfMax": 500,
    }
}

Key parameter tradeoffs:

Full-text Retrieval Implementation: Dify + PostgreSQL

Dify's full-text retrieval is based on PostgreSQL's tsvector type and GIN indexes:

-- Document segments table in Dify database (simplified)
CREATE TABLE document_segments (
    id UUID PRIMARY KEY,
    document_id UUID NOT NULL,
    dataset_id UUID NOT NULL,
    content TEXT NOT NULL,
    
    -- Full-text retrieval: tsvector is PostgreSQL's document format for full-text search
    -- to_tsvector processes and normalizes tokens
    full_text_search_vector TSVECTOR GENERATED ALWAYS AS (
        to_tsvector('simple', content)  -- 'simple' keeps original terms without stemming
    ) STORED,
    
    -- GIN index speeds up full-text search significantly
    -- CREATE INDEX ON document_segments USING GIN(full_text_search_vector);
    
    created_at TIMESTAMP DEFAULT NOW()
);

-- Full-text search query example
SELECT id, content, 
    ts_rank_cd(full_text_search_vector, query) AS score
FROM document_segments,
    to_tsquery('simple', 'data & export') query  -- AND query
WHERE full_text_search_vector @@ query
    AND dataset_id = 'your-dataset-id'
ORDER BY score DESC
LIMIT 5;

Chinese full-text retrieval challenges:

PostgreSQL doesn't natively support Chinese word segmentation. Dify handles Chinese full-text retrieval by:

  1. Splitting Chinese text character by character (each character as a token)
  2. Or using the pg_jieba extension for Chinese word segmentation

This is the fundamental reason why Chinese full-text retrieval has limited effectiveness without proper word segmentation.

Impact of Chunking Strategy on Retrieval Quality

Chunking is a severely underappreciated part of RAG. The chunking strategy directly affects retrieval quality.

Main chunking strategies compared:

# Strategy 1: Fixed-size chunking (Dify default)
def fixed_size_chunking(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap  # Overlap ensures semantic continuity
    return chunks

# Strategy 2: Semantic chunking (split at sentence/paragraph boundaries)
def semantic_chunking(text: str) -> list[str]:
    import re
    # Split by paragraph (blank line)
    paragraphs = re.split(r'\n\s*\n', text)
    
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        if len(current_chunk) + len(para) < 500:
            current_chunk += "\n\n" + para
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

# Strategy 3: Hierarchical chunking (parent + child documents)
# Large chunks (1000 chars) for context, small chunks (200 chars) for retrieval
# Available as "Parent-Child Chunking" option in Dify v0.10+

Measured effectiveness of chunking strategies (1000-page technical document, Q&A accuracy):

Fixed size (500 chars, no overlap):    72%
Fixed size (500 chars, 50 overlap):    78%  ← Dify default
Semantic chunking (by paragraph):     82%
Hierarchical chunking (parent-child): 85%
No chunking (for inherently short docs): 88% (only applicable when docs are already short)

Conclusion: If your knowledge base performance is unsatisfactory, first consider improving your chunking strategy rather than blindly adjusting retrieval parameters.


Level 4: Production Pitfalls and Decision Making (Expert Perspective)

Pitfall 1: The Score Threshold Dilemma

Setting too low (e.g., 0.3):

Setting too high (e.g., 0.8):

How to find the right threshold:

# Diagnostic method: analyze retrieval score distribution
def analyze_retrieval_scores(knowledge_base_id: str, test_queries: list[str]):
    for query in test_queries:
        results = retrieve(knowledge_base_id, query, top_k=10, score_threshold=0)
        
        print(f"\nQuery: {query}")
        for result in results:
            relevance = input(f"Is chunk '{result.text[:50]}...' relevant? (y/n): ")
            if relevance == 'y':
                print(f"  Relevant chunk score: {result.score:.3f}")
            else:
                print(f"  Irrelevant chunk score: {result.score:.3f}")

Recommended threshold ranges in practice:

Pitfall 2: The Root Cause of Knowledge Base "Hallucinated Retrieval"

Why does vector retrieval sometimes return chunks even when the documentation contains nothing relevant?

The reason: Vector retrieval is a relative ranking, not an absolute judgment. Even when a query is completely unrelated to all document chunks, vector retrieval will still return "the most relevant" chunks (which all just have low scores). Without a score threshold, these low-score chunks get passed into the LLM's context, and the LLM may "hallucinate" answers based on them.

Solutions:

  1. Set a reasonable similarity threshold (required)
  2. Explicitly instruct in the system prompt: if retrieved content is irrelevant, don't answer
  3. Add a "retrieval quality validation" node in the workflow
Retrieval quality validation in workflow:

[Knowledge Retrieval Node] → Get retrieval results
      ↓
[LLM Node: Relevance Judgment]
  Prompt: Determine if the following retrieved results are relevant to the user's question
  Input: {{user_question}} + {{retrieval_results}}
  Output: {"is_relevant": true/false, "reason": "..."}
      ↓
[IF/ELSE Branch]
  ├── is_relevant == true  → [LLM Generate Answer]
  └── is_relevant == false → Directly return "No relevant information in documentation"

Pitfall 3: The Cost of Switching Embedding Models

If you want to switch Embedding models after the knowledge base is established (e.g., from text-embedding-ada-002 to text-embedding-3-small), you face a serious problem: all documents must be reprocessed.

Reason: Different Embedding models have different vector spaces that can't be mixed. Vectors generated by Model A cannot be directly compared with query vectors generated by Model B.

Cost assessment:

A knowledge base with 1 million chunks, cost of switching Embedding models:

Embedding cost (text-embedding-3-small at $0.02/1M tokens):
  Assuming average 100 tokens per chunk:
  1M × 100 tokens / 1M × $0.02 = $2

Time cost (batch processing speed ~1,000 chunks/minute):
  1M / 1,000 = 1,000 minutes ≈ 17 hours

Vector database rebuild time: ~2-4 hours

Total downtime: Requires maintenance window or dual-write strategy

Recommendation: Choose your Embedding model carefully at project start. Switching in production has a high cost.

RAG Quality Evaluation Framework

Before going live, evaluate RAG quality using these metrics:

# Core RAGAS evaluation framework metrics
class RAGEvaluation:
    
    def faithfulness(self, answer: str, contexts: list[str]) -> float:
        """
        Faithfulness: Can every statement in the answer be found in the retrieved contexts?
        Score: 0-1, higher is better
        Method: Have LLM check if each sentence has contextual support
        """
        pass
    
    def answer_relevancy(self, answer: str, question: str) -> float:
        """
        Answer relevancy: Does the answer actually address the question?
        Method: Have LLM generate a question based on the answer, check similarity to original
        """
        pass
    
    def context_precision(self, contexts: list[str], question: str) -> float:
        """
        Context precision: Of the retrieved document chunks, how many are truly relevant?
        (Measures noise level)
        """
        pass
    
    def context_recall(self, contexts: list[str], ground_truth: str) -> float:
        """
        Context recall: Of the information needed to answer correctly, how much was retrieved?
        (Measures omission level)
        """
        pass

# Using the RAGAS library:
# pip install ragas
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset=your_test_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
    llm=your_llm,
    embeddings=your_embeddings
)
print(results)

Practical metrics for production (quick evaluation without RAGAS):

Metric 1: Hit Rate
  = Proportion of cases where top-K retrieval results contain the correct answer
  Target: > 85%

Metric 2: MRR (Mean Reciprocal Rank)
  = 1 / rank of correct answer in retrieval results
  Target: > 0.7 (correct answer appears within top 1.5 on average)

Metric 3: User satisfaction (if feedback mechanism exists)
  = User upvotes / total conversations
  Target: > 80%

Metric 4: Correct rejection rate (for out-of-scope questions)
  = Proportion correctly identified as "no relevant information in documentation"
  Target: > 90%

Chapter Summary

RAG isn't a "configure and forget" system — it requires continuous optimization for specific business scenarios. Retrieval quality depends on three key factors: document quality and chunking strategy, retrieval method and parameter configuration, and Reranker usage.

Key Takeaways:

  1. Vector retrieval understands semantics; full-text retrieval matches keywords: Their advantages are complementary — hybrid retrieval is the optimal choice for most scenarios
  2. RRF is the best algorithm for merging retrieval results: Unlike weighted averaging, RRF doesn't require manually tuning weights and is more robust
  3. Reranker is the key to precision improvement: Hybrid retrieval + Reranker typically outperforms pure hybrid retrieval by 5-10 percentage points
  4. Chunking strategy is severely underestimated: If retrieval quality is poor, look at chunking first, then tune parameters
  5. Score threshold needs data-driven calibration: Different document types have different reasonable threshold ranges
  6. Embedding model is hard to change once chosen: Make a careful selection at project start, considering multilingual capability and cost

The next chapter moves into hands-on knowledge base construction: specific configuration of document processing and chunking strategies, and production knowledge base management best practices.

Rate this chapter
4.9  / 5  (62 ratings)

💬 Comments