Chapter 7

Advanced RAG Tuning: Recall Rate, Relevance Scoring and Reranking

Building a RAG system is only the first step; making it reliably work in production requires systematic tuning — this chapter delivers a complete methodology from metric definition to Rerank refinement.

Chapter Overview

Many teams build RAG knowledge-base Q&A systems only to find persistent problems: a user asks a question clearly answerable from the documents, yet the system responds "no relevant information found." Or the system retrieves a pile of irrelevant chunks, causing the model to hallucinate. These two failure modes — insufficient recall and poor precision — are the most common ways RAG systems fail in production.

This chapter provides a systematic RAG tuning methodology. You will learn:

How to quantify recall rate and relevance to establish a measurable quality baseline
How to choose and configure Dify's retrieval strategies (vector, full-text, hybrid)
How Rerank models work and how to integrate them in Dify
Upstream factors that affect recall quality: chunking strategy and embedding model selection
Monitoring and continuous improvement methods for production environments

Level 1: Fundamentals (1–3 Years Experience)

1.1 Three Dimensions of RAG Quality

To understand RAG quality issues, first distinguish three separate dimensions:

Recall: Did the system retrieve the document chunks containing the answer? This is foundational. If relevant content is never retrieved, everything downstream is futile.

Precision: Of the retrieved content, how much is truly relevant? If you feed the model 20 chunks with 18 being noise, it will likely be misled.

Generation Quality: Given adequate recall and precision, can the model correctly understand and generate an accurate answer?

These are sequential dependencies. Recall is the prerequisite, precision is the filter, generation is the output. Always debug in this order — don't optimize prompts when recall isn't guaranteed.

1.2 Using Dify's Evaluation Logs to Diagnose Problems

Dify's "Logs & Annotations" feature is the most direct debugging tool. For every user query, you can see:

Retrieved document chunks: which chunks were recalled and their similarity scores
Top-K used: how many chunks were actually sent to the model
Score distribution: gap between highest and lowest scores

Quick diagnosis workflow:

Ask a question you know is answerable from the documents
Open the logs and examine retrieval results
If relevant chunks have low scores (< 0.5) or were not retrieved at all — this is a recall problem
If relevant chunks were retrieved but drowned out by irrelevant ones — this is a precision problem
If both above look fine but the answer is still wrong — this is a generation quality problem

1.3 How Vector Search Works (Intuitive Understanding)

Vector search converts each text segment into a point in high-dimensional space (a vector). Text with similar meaning occupies nearby positions in this space. At query time, the user's question is also converted to a point, and the K nearest text chunks are found.

Analogy: Imagine all documents as cities on a map, where "semantic similarity" means "geographic proximity." The user's question is a GPS coordinate, and the system finds the nearest cities.

The problem: proximity does not guarantee true relevance. "Apple phone" and "apple price" might both be close to "apple" in vector space, but one discusses iPhone while the other discusses fruit prices — neither answers a question about "Apple Inc. stock price."

This is exactly why Rerank is needed — it applies a more precise model as a second-pass filter.

1.4 Three Retrieval Modes in Dify

In knowledge base settings, Dify provides three retrieval modes:

Vector Search

Pure semantic matching
Best for: conceptual questions, queries with different wording but similar meaning
Weakness: poor performance on exact terms (product codes, names, dates)

Full-Text Search (BM25)

Traditional keyword frequency-based retrieval
Best for: exact terms (product codes, names, dates)
Weakness: no semantic understanding; "buy" and "purchase" may be treated differently

Hybrid Search

Combines vector and full-text scores with weighted fusion
Best for: most production scenarios
Dify uses RRF (Reciprocal Rank Fusion) to merge both result sets

Recommendation: In production, default to hybrid search, then enable Rerank.

1.5 Setting Sensible Top-K and Score Thresholds

Top-K: Controls how many document chunks are retrieved. Too small misses information; too large introduces noise.

Simple Q&A: Top-K = 3–5
Complex analysis: Top-K = 6–10
Note context window limits: 10 chunks × 512 tokens = 5,120 tokens

Score threshold: Chunks below this score are filtered out.

Cosine similarity for vector search: recommended threshold 0.5–0.7
Excessively high thresholds cause recall to drop sharply
With Rerank enabled, threshold applies to Rerank scores (recommend 0.3–0.5)

These parameters are adjustable in Dify under Knowledge Base → Retrieval Settings.

Level 2: Mechanisms in Depth (3–5 Years Experience)

2.1 Building a RAG Evaluation Benchmark

Systematic tuning requires quantifiable evaluation data. Steps to build a benchmark:

Step 1: Construct QA Pairs

Manually create 50–100 QA pairs from your documents, covering:

Direct-answer type: answer is explicitly in one passage
Multi-hop reasoning type: requires combining multiple passages
Negative samples: questions with no answer in the documents (tests the system's ability to decline)

# Example evaluation dataset format
qa_pairs = [
    {
        "question": "What is the company's refund policy?",
        "expected_answer": "7-day no-questions-asked return",
        "relevant_chunk_ids": ["doc_001_chunk_05", "doc_001_chunk_06"],
        "category": "direct"
    },
    {
        "question": "How does the VIP refund policy differ from standard?",
        "expected_answer": "VIP users get 30-day returns",
        "relevant_chunk_ids": ["doc_001_chunk_05", "doc_002_chunk_12"],
        "category": "multi_hop"
    }
]

Step 2: Define Evaluation Metrics

Core metrics:

Recall@K: Among the K retrieved chunks, do they contain all relevant chunks? Formula: Recall@K = |retrieved ∩ relevant| / |relevant|
MRR (Mean Reciprocal Rank): At what position does the first relevant chunk appear? Higher rank = better score
NDCG: Composite metric considering both relevance and rank position

For generation quality:

Faithfulness: Is the answer grounded in the documents (anti-hallucination)?
Answer Relevancy: Does the answer actually address the question?

2.2 Deep Dive: RRF Algorithm for Hybrid Search

RRF (Reciprocal Rank Fusion) is the core of Dify's hybrid search. The formula:

RRF_score(d) = sum of  1 / (k + rank_i(d))  across all retrieval paths i

Where k is typically 60, and rank_i(d) is document d's rank in retrieval path i.

Why use RRF instead of direct weighted scoring?

Vector search cosine similarity and BM25 scores have completely different distributions — they cannot be directly added. Cosine similarity ranges [0,1] while BM25 scores can exceed 20. Direct weighting would let BM25 dominate.

RRF converts both result sets into "rankings," eliminating unit mismatch. Research shows that even without careful tuning, RRF outperforms single-path retrieval in most scenarios.

Configuration example (Dify API mode):

{
  "retrieval_model": {
    "search_method": "hybrid_search",
    "reranking_enable": true,
    "reranking_model": {
      "reranking_provider_name": "cohere",
      "reranking_model_name": "rerank-multilingual-v3.0"
    },
    "top_k": 10,
    "score_threshold_enabled": true,
    "score_threshold": 0.3
  }
}

2.3 How Rerank Models Work

Rerank is a cross-encoder model specialized in judging "query-document" relevance.

Fundamental difference from vector search:

Feature	Vector Search (Bi-encoder)	Rerank (Cross-encoder)
Encoding	Query and document encoded separately	Query + document concatenated, encoded jointly
Relevance understanding	Overall semantic similarity	Fine-grained word-level interaction
Speed	Very fast (pre-computed vectors)	Slow (computed per pair in real time)
Accuracy	Medium	High
Typical use	Coarse recall (retrieve 100 candidates)	Fine ranking (select Top-5 from 100)

Rerank models handle complex relevance judgments that bi-encoders struggle with:

Understanding negation semantics ("does not support credit card" vs "credit card payment")
Disambiguating polysemous words
Understanding implicit intent in questions

2.4 Comparing Popular Rerank Models

In Dify, you can integrate these Rerank providers:

Cohere Rerank

Model: rerank-multilingual-v3.0
Pros: Excellent multilingual support, outstanding Chinese performance, simple API
Latency: 50–200ms per request (depending on document count)
Cost: approximately $1 per 1,000 searches (1 search = 1 query x N documents)

Jina Rerank

Model: jina-reranker-v2-base-multilingual
Pros: Generous free tier, supports long documents (8,192 tokens)
Best for: budget-sensitive scenarios with longer documents

Local Deployment (Recommended): BAAI/bge-reranker-v2-m3

Deploy via Xinference or Ollama
Multilingual quality close to Cohere, completely free
Hardware: 16GB RAM sufficient to run; GPU dramatically speeds up inference

# Deploy BGE Reranker via Xinference
xinference launch \
  --model-name bge-reranker-v2-m3 \
  --model-type rerank \
  --device cuda

In Dify: Settings → Model Providers → Add local Xinference endpoint to use local Rerank.

2.5 Impact of Chunking Strategy on Recall Quality

Chunking is severely underestimated in the RAG pipeline. The chunking approach directly determines the ceiling of recall quality.

Fixed-size Chunking

Config: chunk_size=512 tokens, overlap=50 tokens
Problem: sentences may be split mid-way, losing context
Best for: well-structured documents (API docs, FAQs)

Semantic Chunking

Splits based on semantic cohesion; same topic stays together
Supported in Dify 0.10+
Effect: approximately 15–25% Recall@5 improvement at equal Top-K

Structural Chunking

Splits by document structure: headings, paragraphs, lists
Best for documents with clear hierarchy (technical manuals, legal documents)
In Dify: enable "Parent-Child Chunking" — retrieve small chunks but send large chunks to the model

Parent-Child Chunking in Practice:

Document structure:
  Chapter 3: Refund Policy [Parent chunk = full chapter]
    3.1 Standard User Refunds [Child chunk = small paragraph]
    3.2 VIP User Refunds [Child chunk = small paragraph]
    3.3 Special Product Rules [Child chunk = small paragraph]

Retrieval: use fine-grained child chunks for vector search (higher precision)
Model input: after finding child chunk, pass its parent chunk (preserves full context)

In Dify: Knowledge Base → Document → Segmentation → Select "Parent-Child Segmentation Mode."

Level 3: Source Code and Principles (5+ Years Experience)

3.1 Dify RAG Pipeline Source Code Analysis

Dify's RAG retrieval pipeline lives in api/core/rag/. Core flow:

DatasetRetrieval
  ├── retrieve()
  │   ├── _single_retrieve()     # Single knowledge base retrieval
  │   └── _multi_retrieve()      # Multi knowledge base retrieval
  │       └── Concurrent retrieval across datasets
  │
  ├── Vector retrieval path
  │   └── VectorIndex.search()
  │       ├── embed_query()       # Vectorize query
  │       └── vector_store.search() # ANN search
  │
  ├── Full-text retrieval path
  │   └── KeywordIndex.search()
  │       └── BM25 implementation
  │
  └── Rerank path
      └── RerankRunner.run()
          ├── Call Rerank API
          └── Re-sort by Rerank scores

Key code path (api/core/rag/datasource/retrieval_service.py):

class RetrievalService:
    @classmethod
    def retrieve(cls, retrieval_method: str, dataset_id: str,
                 query: str, top_k: int, score_threshold: float,
                 reranking_model: dict = None) -> list[Document]:

        if retrieval_method == RetrievalMethod.HYBRID_SEARCH.value:
            # Execute vector and keyword search concurrently
            with ThreadPoolExecutor() as executor:
                vector_future = executor.submit(
                    cls._vector_search, dataset_id, query, top_k * 2
                )
                keyword_future = executor.submit(
                    cls._keyword_search, dataset_id, query, top_k * 2
                )
                vector_results = vector_future.result()
                keyword_results = keyword_future.result()

            # RRF fusion
            results = cls._reciprocal_rank_fusion(
                [vector_results, keyword_results], top_k
            )

        # Rerank refinement
        if reranking_model and len(results) > 0:
            results = RerankRunner(reranking_model).run(
                query, results, score_threshold, top_k
            )

        return results

3.2 Vector Index Internals: pgvector vs Qdrant vs Weaviate

Dify supports multiple vector databases; their implementation differences significantly affect performance:

pgvector (PostgreSQL extension)

Index type: HNSW (Hierarchical Navigable Small World)
Parameters: m=16, ef_construction=64 (graph density and build-time search width)
Character: co-located with Dify's main database, simple operations; sufficient for up to about 1M vectors
Search complexity: O(log N) approximate

-- Dify's pgvector index creation
CREATE INDEX ON embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- ef_search at query time (higher = more accurate but slower)
SET hnsw.ef_search = 100;

Qdrant

Index type: HNSW + payload filtering
Feature: Filter by metadata during vector search (e.g., "only search 2024 documents")
Performance: Clear advantage above 10M vectors
Dify config: set VECTOR_STORE=qdrant and configure QDRANT_URL

Weaviate

Unique feature: native BM25 + vector hybrid search (Dify does not need to implement RRF separately)
GraphQL query interface, suitable for complex filtering scenarios
Mature multi-tenancy support, ideal for SaaS deployments

Performance benchmark (1M vectors, dim=1536, Top-10):

Database	Latency P50	Latency P99	QPS
pgvector (HNSW)	12ms	45ms	500
Qdrant	5ms	18ms	2,000
Weaviate	8ms	30ms	1,200

3.3 RAG Evaluation Framework: RAGAS Integration

RAGAS (RAG Assessment) is an evaluation framework specifically for RAG systems, automating computation of core metrics:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision
)
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": ["What is the refund policy?", "How do I become a VIP?"],
    "answer": ["7-day no-questions-asked return", "Spend $1000/year to qualify"],
    "contexts": [
        ["Refund policy: within 7 days of purchase...", "Exclusions apply..."],
        ["VIP qualification: annual spending of $1000..."]
    ],
    "ground_truth": [
        "7-day return policy",
        "Annual spending of $1000 triggers automatic VIP upgrade"
    ]
}

dataset = Dataset.from_dict(eval_data)

result = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,         # Is the answer grounded in documents?
        answer_relevancy,     # Does the answer address the question?
        context_recall,       # Were relevant contexts retrieved?
        context_precision,    # How precise is the retrieved context?
    ]
)

print(result)
# {'faithfulness': 0.91, 'answer_relevancy': 0.88,
#  'context_recall': 0.79, 'context_precision': 0.85}

Integrate RAGAS into CI/CD pipelines to automatically run evaluations on every knowledge base configuration change, preventing quality regressions.

Level 4: Production Pitfalls and Decision-Making (Expert Perspective)

4.1 Pitfall 1: Recall Failure from Ignoring Query Rewriting

Problem: User phrasing and document wording may be entirely different.

User asks: "How much does this thing cost?" (colloquial, missing context) Document says: "Pricing plans: Basic Plan $299/month..."

Vector search performs poorly here because the semantic vectors diverge significantly.

Solution: Query Rewriting

Before retrieval, use an LLM to rewrite and expand the user's query:

QUERY_REWRITE_PROMPT = """
You are a search query optimization expert. Rewrite the user's question
into a form better suited for document retrieval.

Requirements:
1. Fill in missing context (resolve pronouns to explicit terms)
2. Generate 2-3 semantically similar query variants
3. Output in JSON format

User question: {query}

Output format:
{
  "rewritten": "primary rewritten query",
  "variants": ["variant 1", "variant 2"]
}
"""

Implement in a Dify workflow: LLM node (query rewriting) then Knowledge base nodes (parallel retrieval with multiple variants) then deduplicate and merge results.

This approach typically improves Recall@5 by 20–40% on colloquial questions.

4.2 Pitfall 2: Rerank Computation Explosion

Rerank computation scales as O(queries multiplied by candidates). With Top-K set to 50, Rerank must compute 50 relevance scores per query, causing latency to spike dramatically.

Wrong configuration:

Top-K = 50 → Rerank 50 documents → P99 latency > 2 seconds

Correct approach — two-stage retrieval:

Stage 1 (vector search): Top-K = 30 (broad recall)
Stage 2 (Rerank): Input 30 documents, output Top-5
Final to model: 5 high-quality documents

Cohere Rerank v3 latency is approximately 150ms for 30 documents vs 50ms for 5 — a significant difference that compounds under load.

4.3 Pitfall 3: Stale Vectors After Document Updates

Symptom: You updated a document's content, but the system still retrieves the old version.

Root cause: Dify does not automatically re-index existing documents. After modifying a document, you must manually trigger re-indexing.

Production solution:

# Detect documents needing re-indexing via Dify API
import hashlib
import requests
import os

def check_and_reindex(dataset_id, document_path, api_key):
    with open(document_path, 'rb') as f:
        current_hash = hashlib.md5(f.read()).hexdigest()

    # Fetch document info from Dify
    doc_info = requests.get(
        f"{DIFY_BASE_URL}/datasets/{dataset_id}/documents",
        headers={"Authorization": f"Bearer {api_key}"}
    ).json()

    for doc in doc_info['data']:
        if doc['name'] == os.path.basename(document_path):
            stored_hash = doc.get('custom_metadata', {}).get('md5')
            if stored_hash != current_hash:
                trigger_reindex(dataset_id, doc['id'], api_key)
                update_doc_metadata(dataset_id, doc['id'], current_hash, api_key)

Establish document hash tracking with a daily scheduled job to keep the knowledge base synchronized with source files.

4.4 Pitfall 4: Data Contamination in Multi-tenant Scenarios

In SaaS scenarios, different customers' data must be strictly isolated — Customer A's queries must never retrieve Customer B's documents.

Dify's isolation approach: Create a separate Dataset per customer; at query time, specify only that customer's Dataset ID. This is the safest isolation method.

But when customer count reaches hundreds, maintaining many Datasets becomes operationally expensive.

Alternative: Metadata Filtering (supported by Qdrant and Weaviate):

# Qdrant payload filtering for multi-tenancy
from qdrant_client.http.models import Filter, FieldCondition, MatchValue

search_result = qdrant_client.search(
    collection_name="all_documents",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="tenant_id",
                match=MatchValue(value=current_tenant_id)
            )
        ]
    ),
    limit=10
)

Note: Metadata filtering requires additional application-layer encapsulation in Dify — more complex than Dataset isolation, but with lower long-term operational overhead at scale.

4.5 Decision Framework: Choosing the Right RAG Configuration

What is your scenario?
│
├── Documents < 100K tokens + limited budget
│   → pgvector + hybrid search + BGE-Reranker (local deployment)
│
├── Documents > 1M tokens + high concurrency
│   → Qdrant + hybrid search + Cohere Rerank
│
├── Multilingual documents (mixed Chinese/English)
│   → bge-m3 embedding + Cohere multilingual Rerank
│
├── Need exact keyword matching (product codes, etc.)
│   → Hybrid search (increase BM25 weight) + Rerank
│
└── Extreme latency requirements (P99 < 500ms)
    → Pure vector search (skip Rerank) + lower Top-K

Chapter Summary

RAG tuning is a systems engineering effort with no single silver bullet. Key takeaways:

Metric-driven approach: Build an evaluation benchmark first, quantify current quality with Recall@K and MRR, then optimize deliberately — not by intuition.

Retrieval strategy: Default to hybrid search (vector + BM25) in production — on average 15% higher Recall@5 than single-path retrieval.

Rerank is essential: Adding Rerank on top of hybrid search typically improves Precision@5 by another 20–30%; local BGE Reranker is the best cost-efficiency option.

Upstream quality sets the ceiling: Embedding model selection and chunking strategy determine the recall ceiling — get these right before optimizing downstream stages.

Production checklist:

Hybrid search enabled
Rerank model configured (recommend BGE-m3 local or Cohere)
Top-K calibrated (20–30 for initial retrieval, 5–10 after Rerank)
Score threshold tested (avoid filtering out genuinely relevant content)
Chunking strategy matches document type
Evaluation benchmark established and run regularly
Document update monitoring deployed

Rate this chapter

4.6 / 5 (48 ratings)