Chapter 7

Advanced RAG Tuning: Recall Rate, Relevance Scoring and Reranking

Chapter 7: Deep RAG Tuning — Recall Rate, Relevance Evaluation, and Rerank Refinement

Building a RAG system is only the first step; making it reliably work in production requires systematic tuning — this chapter delivers a complete methodology from metric definition to Rerank refinement.

Chapter Overview

Many teams build RAG knowledge-base Q&A systems only to find persistent problems: a user asks a question clearly answerable from the documents, yet the system responds "no relevant information found." Or the system retrieves a pile of irrelevant chunks, causing the model to hallucinate. These two failure modes — insufficient recall and poor precision — are the most common ways RAG systems fail in production.

This chapter provides a systematic RAG tuning methodology. You will learn:


Level 1: Fundamentals (1–3 Years Experience)

1.1 Three Dimensions of RAG Quality

To understand RAG quality issues, first distinguish three separate dimensions:

Recall: Did the system retrieve the document chunks containing the answer? This is foundational. If relevant content is never retrieved, everything downstream is futile.

Precision: Of the retrieved content, how much is truly relevant? If you feed the model 20 chunks with 18 being noise, it will likely be misled.

Generation Quality: Given adequate recall and precision, can the model correctly understand and generate an accurate answer?

These are sequential dependencies. Recall is the prerequisite, precision is the filter, generation is the output. Always debug in this order — don't optimize prompts when recall isn't guaranteed.

1.2 Using Dify's Evaluation Logs to Diagnose Problems

Dify's "Logs & Annotations" feature is the most direct debugging tool. For every user query, you can see:

Quick diagnosis workflow:

  1. Ask a question you know is answerable from the documents
  2. Open the logs and examine retrieval results
  3. If relevant chunks have low scores (< 0.5) or were not retrieved at all — this is a recall problem
  4. If relevant chunks were retrieved but drowned out by irrelevant ones — this is a precision problem
  5. If both above look fine but the answer is still wrong — this is a generation quality problem

1.3 How Vector Search Works (Intuitive Understanding)

Vector search converts each text segment into a point in high-dimensional space (a vector). Text with similar meaning occupies nearby positions in this space. At query time, the user's question is also converted to a point, and the K nearest text chunks are found.

Analogy: Imagine all documents as cities on a map, where "semantic similarity" means "geographic proximity." The user's question is a GPS coordinate, and the system finds the nearest cities.

The problem: proximity does not guarantee true relevance. "Apple phone" and "apple price" might both be close to "apple" in vector space, but one discusses iPhone while the other discusses fruit prices — neither answers a question about "Apple Inc. stock price."

This is exactly why Rerank is needed — it applies a more precise model as a second-pass filter.

1.4 Three Retrieval Modes in Dify

In knowledge base settings, Dify provides three retrieval modes:

Vector Search

Full-Text Search (BM25)

Hybrid Search

Recommendation: In production, default to hybrid search, then enable Rerank.

1.5 Setting Sensible Top-K and Score Thresholds

Top-K: Controls how many document chunks are retrieved. Too small misses information; too large introduces noise.

Score threshold: Chunks below this score are filtered out.

These parameters are adjustable in Dify under Knowledge Base → Retrieval Settings.


Level 2: Mechanisms in Depth (3–5 Years Experience)

2.1 Building a RAG Evaluation Benchmark

Systematic tuning requires quantifiable evaluation data. Steps to build a benchmark:

Step 1: Construct QA Pairs

Manually create 50–100 QA pairs from your documents, covering:

# Example evaluation dataset format
qa_pairs = [
    {
        "question": "What is the company's refund policy?",
        "expected_answer": "7-day no-questions-asked return",
        "relevant_chunk_ids": ["doc_001_chunk_05", "doc_001_chunk_06"],
        "category": "direct"
    },
    {
        "question": "How does the VIP refund policy differ from standard?",
        "expected_answer": "VIP users get 30-day returns",
        "relevant_chunk_ids": ["doc_001_chunk_05", "doc_002_chunk_12"],
        "category": "multi_hop"
    }
]

Step 2: Define Evaluation Metrics

Core metrics:

For generation quality:

RRF (Reciprocal Rank Fusion) is the core of Dify's hybrid search. The formula:

RRF_score(d) = sum of  1 / (k + rank_i(d))  across all retrieval paths i

Where k is typically 60, and rank_i(d) is document d's rank in retrieval path i.

Why use RRF instead of direct weighted scoring?

Vector search cosine similarity and BM25 scores have completely different distributions — they cannot be directly added. Cosine similarity ranges [0,1] while BM25 scores can exceed 20. Direct weighting would let BM25 dominate.

RRF converts both result sets into "rankings," eliminating unit mismatch. Research shows that even without careful tuning, RRF outperforms single-path retrieval in most scenarios.

Configuration example (Dify API mode):

{
  "retrieval_model": {
    "search_method": "hybrid_search",
    "reranking_enable": true,
    "reranking_model": {
      "reranking_provider_name": "cohere",
      "reranking_model_name": "rerank-multilingual-v3.0"
    },
    "top_k": 10,
    "score_threshold_enabled": true,
    "score_threshold": 0.3
  }
}

2.3 How Rerank Models Work

Rerank is a cross-encoder model specialized in judging "query-document" relevance.

Fundamental difference from vector search:

Feature Vector Search (Bi-encoder) Rerank (Cross-encoder)
Encoding Query and document encoded separately Query + document concatenated, encoded jointly
Relevance understanding Overall semantic similarity Fine-grained word-level interaction
Speed Very fast (pre-computed vectors) Slow (computed per pair in real time)
Accuracy Medium High
Typical use Coarse recall (retrieve 100 candidates) Fine ranking (select Top-5 from 100)

Rerank models handle complex relevance judgments that bi-encoders struggle with:

In Dify, you can integrate these Rerank providers:

Cohere Rerank

Jina Rerank

Local Deployment (Recommended): BAAI/bge-reranker-v2-m3

# Deploy BGE Reranker via Xinference
xinference launch \
  --model-name bge-reranker-v2-m3 \
  --model-type rerank \
  --device cuda

In Dify: Settings → Model Providers → Add local Xinference endpoint to use local Rerank.

2.5 Impact of Chunking Strategy on Recall Quality

Chunking is severely underestimated in the RAG pipeline. The chunking approach directly determines the ceiling of recall quality.

Fixed-size Chunking

Semantic Chunking

Structural Chunking

Parent-Child Chunking in Practice:

Document structure:
  Chapter 3: Refund Policy [Parent chunk = full chapter]
    3.1 Standard User Refunds [Child chunk = small paragraph]
    3.2 VIP User Refunds [Child chunk = small paragraph]
    3.3 Special Product Rules [Child chunk = small paragraph]

Retrieval: use fine-grained child chunks for vector search (higher precision)
Model input: after finding child chunk, pass its parent chunk (preserves full context)

In Dify: Knowledge Base → Document → Segmentation → Select "Parent-Child Segmentation Mode."


Level 3: Source Code and Principles (5+ Years Experience)

3.1 Dify RAG Pipeline Source Code Analysis

Dify's RAG retrieval pipeline lives in api/core/rag/. Core flow:

DatasetRetrieval
  ├── retrieve()
  │   ├── _single_retrieve()     # Single knowledge base retrieval
  │   └── _multi_retrieve()      # Multi knowledge base retrieval
  │       └── Concurrent retrieval across datasets
  │
  ├── Vector retrieval path
  │   └── VectorIndex.search()
  │       ├── embed_query()       # Vectorize query
  │       └── vector_store.search() # ANN search
  │
  ├── Full-text retrieval path
  │   └── KeywordIndex.search()
  │       └── BM25 implementation
  │
  └── Rerank path
      └── RerankRunner.run()
          ├── Call Rerank API
          └── Re-sort by Rerank scores

Key code path (api/core/rag/datasource/retrieval_service.py):

class RetrievalService:
    @classmethod
    def retrieve(cls, retrieval_method: str, dataset_id: str,
                 query: str, top_k: int, score_threshold: float,
                 reranking_model: dict = None) -> list[Document]:

        if retrieval_method == RetrievalMethod.HYBRID_SEARCH.value:
            # Execute vector and keyword search concurrently
            with ThreadPoolExecutor() as executor:
                vector_future = executor.submit(
                    cls._vector_search, dataset_id, query, top_k * 2
                )
                keyword_future = executor.submit(
                    cls._keyword_search, dataset_id, query, top_k * 2
                )
                vector_results = vector_future.result()
                keyword_results = keyword_future.result()

            # RRF fusion
            results = cls._reciprocal_rank_fusion(
                [vector_results, keyword_results], top_k
            )

        # Rerank refinement
        if reranking_model and len(results) > 0:
            results = RerankRunner(reranking_model).run(
                query, results, score_threshold, top_k
            )

        return results

3.2 Vector Index Internals: pgvector vs Qdrant vs Weaviate

Dify supports multiple vector databases; their implementation differences significantly affect performance:

pgvector (PostgreSQL extension)

-- Dify's pgvector index creation
CREATE INDEX ON embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- ef_search at query time (higher = more accurate but slower)
SET hnsw.ef_search = 100;

Qdrant

Weaviate

Performance benchmark (1M vectors, dim=1536, Top-10):

Database Latency P50 Latency P99 QPS
pgvector (HNSW) 12ms 45ms 500
Qdrant 5ms 18ms 2,000
Weaviate 8ms 30ms 1,200

3.3 RAG Evaluation Framework: RAGAS Integration

RAGAS (RAG Assessment) is an evaluation framework specifically for RAG systems, automating computation of core metrics:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": ["What is the refund policy?", "How do I become a VIP?"],
    "answer": ["7-day no-questions-asked return", "Spend $1000/year to qualify"],
    "contexts": [
        ["Refund policy: within 7 days of purchase...", "Exclusions apply..."],
        ["VIP qualification: annual spending of $1000..."]
    ],
    "ground_truth": [
        "7-day return policy",
        "Annual spending of $1000 triggers automatic VIP upgrade"
    ]
}

dataset = Dataset.from_dict(eval_data)

result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,         # Is the answer grounded in documents?
        answer_relevancy,     # Does the answer address the question?
        context_recall,       # Were relevant contexts retrieved?
        context_precision,    # How precise is the retrieved context?
    ]
)

print(result)
# {'faithfulness': 0.91, 'answer_relevancy': 0.88,
#  'context_recall': 0.79, 'context_precision': 0.85}

Integrate RAGAS into CI/CD pipelines to automatically run evaluations on every knowledge base configuration change, preventing quality regressions.


Level 4: Production Pitfalls and Decision-Making (Expert Perspective)

4.1 Pitfall 1: Recall Failure from Ignoring Query Rewriting

Problem: User phrasing and document wording may be entirely different.

User asks: "How much does this thing cost?" (colloquial, missing context) Document says: "Pricing plans: Basic Plan $299/month..."

Vector search performs poorly here because the semantic vectors diverge significantly.

Solution: Query Rewriting

Before retrieval, use an LLM to rewrite and expand the user's query:

QUERY_REWRITE_PROMPT = """
You are a search query optimization expert. Rewrite the user's question
into a form better suited for document retrieval.

Requirements:
1. Fill in missing context (resolve pronouns to explicit terms)
2. Generate 2-3 semantically similar query variants
3. Output in JSON format

User question: {query}

Output format:
{
  "rewritten": "primary rewritten query",
  "variants": ["variant 1", "variant 2"]
}
"""

Implement in a Dify workflow: LLM node (query rewriting) then Knowledge base nodes (parallel retrieval with multiple variants) then deduplicate and merge results.

This approach typically improves Recall@5 by 20–40% on colloquial questions.

4.2 Pitfall 2: Rerank Computation Explosion

Rerank computation scales as O(queries multiplied by candidates). With Top-K set to 50, Rerank must compute 50 relevance scores per query, causing latency to spike dramatically.

Wrong configuration:

Top-K = 50 → Rerank 50 documents → P99 latency > 2 seconds

Correct approach — two-stage retrieval:

Stage 1 (vector search): Top-K = 30 (broad recall)
Stage 2 (Rerank): Input 30 documents, output Top-5
Final to model: 5 high-quality documents

Cohere Rerank v3 latency is approximately 150ms for 30 documents vs 50ms for 5 — a significant difference that compounds under load.

4.3 Pitfall 3: Stale Vectors After Document Updates

Symptom: You updated a document's content, but the system still retrieves the old version.

Root cause: Dify does not automatically re-index existing documents. After modifying a document, you must manually trigger re-indexing.

Production solution:

# Detect documents needing re-indexing via Dify API
import hashlib
import requests
import os

def check_and_reindex(dataset_id, document_path, api_key):
    with open(document_path, 'rb') as f:
        current_hash = hashlib.md5(f.read()).hexdigest()

    # Fetch document info from Dify
    doc_info = requests.get(
        f"{DIFY_BASE_URL}/datasets/{dataset_id}/documents",
        headers={"Authorization": f"Bearer {api_key}"}
    ).json()

    for doc in doc_info['data']:
        if doc['name'] == os.path.basename(document_path):
            stored_hash = doc.get('custom_metadata', {}).get('md5')
            if stored_hash != current_hash:
                trigger_reindex(dataset_id, doc['id'], api_key)
                update_doc_metadata(dataset_id, doc['id'], current_hash, api_key)

Establish document hash tracking with a daily scheduled job to keep the knowledge base synchronized with source files.

4.4 Pitfall 4: Data Contamination in Multi-tenant Scenarios

In SaaS scenarios, different customers' data must be strictly isolated — Customer A's queries must never retrieve Customer B's documents.

Dify's isolation approach: Create a separate Dataset per customer; at query time, specify only that customer's Dataset ID. This is the safest isolation method.

But when customer count reaches hundreds, maintaining many Datasets becomes operationally expensive.

Alternative: Metadata Filtering (supported by Qdrant and Weaviate):

# Qdrant payload filtering for multi-tenancy
from qdrant_client.http.models import Filter, FieldCondition, MatchValue

search_result = qdrant_client.search(
    collection_name="all_documents",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="tenant_id",
                match=MatchValue(value=current_tenant_id)
            )
        ]
    ),
    limit=10
)

Note: Metadata filtering requires additional application-layer encapsulation in Dify — more complex than Dataset isolation, but with lower long-term operational overhead at scale.

4.5 Decision Framework: Choosing the Right RAG Configuration

What is your scenario?
│
├── Documents < 100K tokens + limited budget
│   → pgvector + hybrid search + BGE-Reranker (local deployment)
│
├── Documents > 1M tokens + high concurrency
│   → Qdrant + hybrid search + Cohere Rerank
│
├── Multilingual documents (mixed Chinese/English)
│   → bge-m3 embedding + Cohere multilingual Rerank
│
├── Need exact keyword matching (product codes, etc.)
│   → Hybrid search (increase BM25 weight) + Rerank
│
└── Extreme latency requirements (P99 < 500ms)
    → Pure vector search (skip Rerank) + lower Top-K

Chapter Summary

RAG tuning is a systems engineering effort with no single silver bullet. Key takeaways:

Metric-driven approach: Build an evaluation benchmark first, quantify current quality with Recall@K and MRR, then optimize deliberately — not by intuition.

Retrieval strategy: Default to hybrid search (vector + BM25) in production — on average 15% higher Recall@5 than single-path retrieval.

Rerank is essential: Adding Rerank on top of hybrid search typically improves Precision@5 by another 20–30%; local BGE Reranker is the best cost-efficiency option.

Upstream quality sets the ceiling: Embedding model selection and chunking strategy determine the recall ceiling — get these right before optimizing downstream stages.

Production checklist:

Rate this chapter
4.6  / 5  (48 ratings)

💬 Comments