Advanced RAG Tuning: Recall Rate, Relevance Scoring and Reranking
Chapter 7: Deep RAG Tuning โ Recall Rate, Relevance Evaluation, and Rerank Refinement
Building a RAG system is only the first step; making it reliably work in production requires systematic tuning โ this chapter delivers a complete methodology from metric definition to Rerank refinement.
Chapter Overview
Many teams build RAG knowledge-base Q&A systems only to find persistent problems: a user asks a question clearly answerable from the documents, yet the system responds "no relevant information found." Or the system retrieves a pile of irrelevant chunks, causing the model to hallucinate. These two failure modes โ insufficient recall and poor precision โ are the most common ways RAG systems fail in production.
This chapter provides a systematic RAG tuning methodology. You will learn:
- How to quantify recall rate and relevance to establish a measurable quality baseline
- How to choose and configure Dify's retrieval strategies (vector, full-text, hybrid)
- How Rerank models work and how to integrate them in Dify
- Upstream factors that affect recall quality: chunking strategy and embedding model selection
- Monitoring and continuous improvement methods for production environments
Level 1: Fundamentals (1โ3 Years Experience)
1.1 Three Dimensions of RAG Quality
To understand RAG quality issues, first distinguish three separate dimensions:
Recall: Did the system retrieve the document chunks containing the answer? This is foundational. If relevant content is never retrieved, everything downstream is futile.
Precision: Of the retrieved content, how much is truly relevant? If you feed the model 20 chunks with 18 being noise, it will likely be misled.
Generation Quality: Given adequate recall and precision, can the model correctly understand and generate an accurate answer?
These are sequential dependencies. Recall is the prerequisite, precision is the filter, generation is the output. Always debug in this order โ don't optimize prompts when recall isn't guaranteed.
1.2 Using Dify's Evaluation Logs to Diagnose Problems
Dify's "Logs & Annotations" feature is the most direct debugging tool. For every user query, you can see:
- Retrieved document chunks: which chunks were recalled and their similarity scores
- Top-K used: how many chunks were actually sent to the model
- Score distribution: gap between highest and lowest scores
Quick diagnosis workflow:
- Ask a question you know is answerable from the documents
- Open the logs and examine retrieval results
- If relevant chunks have low scores (< 0.5) or were not retrieved at all โ this is a recall problem
- If relevant chunks were retrieved but drowned out by irrelevant ones โ this is a precision problem
- If both above look fine but the answer is still wrong โ this is a generation quality problem
1.3 How Vector Search Works (Intuitive Understanding)
Vector search converts each text segment into a point in high-dimensional space (a vector). Text with similar meaning occupies nearby positions in this space. At query time, the user's question is also converted to a point, and the K nearest text chunks are found.
Analogy: Imagine all documents as cities on a map, where "semantic similarity" means "geographic proximity." The user's question is a GPS coordinate, and the system finds the nearest cities.
The problem: proximity does not guarantee true relevance. "Apple phone" and "apple price" might both be close to "apple" in vector space, but one discusses iPhone while the other discusses fruit prices โ neither answers a question about "Apple Inc. stock price."
This is exactly why Rerank is needed โ it applies a more precise model as a second-pass filter.
1.4 Three Retrieval Modes in Dify
In knowledge base settings, Dify provides three retrieval modes:
Vector Search
- Pure semantic matching
- Best for: conceptual questions, queries with different wording but similar meaning
- Weakness: poor performance on exact terms (product codes, names, dates)
Full-Text Search (BM25)
- Traditional keyword frequency-based retrieval
- Best for: exact terms (product codes, names, dates)
- Weakness: no semantic understanding; "buy" and "purchase" may be treated differently
Hybrid Search
- Combines vector and full-text scores with weighted fusion
- Best for: most production scenarios
- Dify uses RRF (Reciprocal Rank Fusion) to merge both result sets
Recommendation: In production, default to hybrid search, then enable Rerank.
1.5 Setting Sensible Top-K and Score Thresholds
Top-K: Controls how many document chunks are retrieved. Too small misses information; too large introduces noise.
- Simple Q&A: Top-K = 3โ5
- Complex analysis: Top-K = 6โ10
- Note context window limits: 10 chunks ร 512 tokens = 5,120 tokens
Score threshold: Chunks below this score are filtered out.
- Cosine similarity for vector search: recommended threshold 0.5โ0.7
- Excessively high thresholds cause recall to drop sharply
- With Rerank enabled, threshold applies to Rerank scores (recommend 0.3โ0.5)
These parameters are adjustable in Dify under Knowledge Base โ Retrieval Settings.
Level 2: Mechanisms in Depth (3โ5 Years Experience)
2.1 Building a RAG Evaluation Benchmark
Systematic tuning requires quantifiable evaluation data. Steps to build a benchmark:
Step 1: Construct QA Pairs
Manually create 50โ100 QA pairs from your documents, covering:
- Direct-answer type: answer is explicitly in one passage
- Multi-hop reasoning type: requires combining multiple passages
- Negative samples: questions with no answer in the documents (tests the system's ability to decline)
# Example evaluation dataset format
qa_pairs = [
{
"question": "What is the company's refund policy?",
"expected_answer": "7-day no-questions-asked return",
"relevant_chunk_ids": ["doc_001_chunk_05", "doc_001_chunk_06"],
"category": "direct"
},
{
"question": "How does the VIP refund policy differ from standard?",
"expected_answer": "VIP users get 30-day returns",
"relevant_chunk_ids": ["doc_001_chunk_05", "doc_002_chunk_12"],
"category": "multi_hop"
}
]
Step 2: Define Evaluation Metrics
Core metrics:
- Recall@K: Among the K retrieved chunks, do they contain all relevant chunks? Formula:
Recall@K = |retrieved โฉ relevant| / |relevant| - MRR (Mean Reciprocal Rank): At what position does the first relevant chunk appear? Higher rank = better score
- NDCG: Composite metric considering both relevance and rank position
For generation quality:
- Faithfulness: Is the answer grounded in the documents (anti-hallucination)?
- Answer Relevancy: Does the answer actually address the question?
2.2 Deep Dive: RRF Algorithm for Hybrid Search
RRF (Reciprocal Rank Fusion) is the core of Dify's hybrid search. The formula:
RRF_score(d) = sum of 1 / (k + rank_i(d)) across all retrieval paths i
Where k is typically 60, and rank_i(d) is document d's rank in retrieval path i.
Why use RRF instead of direct weighted scoring?
Vector search cosine similarity and BM25 scores have completely different distributions โ they cannot be directly added. Cosine similarity ranges [0,1] while BM25 scores can exceed 20. Direct weighting would let BM25 dominate.
RRF converts both result sets into "rankings," eliminating unit mismatch. Research shows that even without careful tuning, RRF outperforms single-path retrieval in most scenarios.
Configuration example (Dify API mode):
{
"retrieval_model": {
"search_method": "hybrid_search",
"reranking_enable": true,
"reranking_model": {
"reranking_provider_name": "cohere",
"reranking_model_name": "rerank-multilingual-v3.0"
},
"top_k": 10,
"score_threshold_enabled": true,
"score_threshold": 0.3
}
}
2.3 How Rerank Models Work
Rerank is a cross-encoder model specialized in judging "query-document" relevance.
Fundamental difference from vector search:
| Feature | Vector Search (Bi-encoder) | Rerank (Cross-encoder) |
|---|---|---|
| Encoding | Query and document encoded separately | Query + document concatenated, encoded jointly |
| Relevance understanding | Overall semantic similarity | Fine-grained word-level interaction |
| Speed | Very fast (pre-computed vectors) | Slow (computed per pair in real time) |
| Accuracy | Medium | High |
| Typical use | Coarse recall (retrieve 100 candidates) | Fine ranking (select Top-5 from 100) |
Rerank models handle complex relevance judgments that bi-encoders struggle with:
- Understanding negation semantics ("does not support credit card" vs "credit card payment")
- Disambiguating polysemous words
- Understanding implicit intent in questions
2.4 Comparing Popular Rerank Models
In Dify, you can integrate these Rerank providers:
Cohere Rerank
- Model:
rerank-multilingual-v3.0 - Pros: Excellent multilingual support, outstanding Chinese performance, simple API
- Latency: 50โ200ms per request (depending on document count)
- Cost: approximately $1 per 1,000 searches (1 search = 1 query x N documents)
Jina Rerank
- Model:
jina-reranker-v2-base-multilingual - Pros: Generous free tier, supports long documents (8,192 tokens)
- Best for: budget-sensitive scenarios with longer documents
Local Deployment (Recommended): BAAI/bge-reranker-v2-m3
- Deploy via Xinference or Ollama
- Multilingual quality close to Cohere, completely free
- Hardware: 16GB RAM sufficient to run; GPU dramatically speeds up inference
# Deploy BGE Reranker via Xinference
xinference launch \
--model-name bge-reranker-v2-m3 \
--model-type rerank \
--device cuda
In Dify: Settings โ Model Providers โ Add local Xinference endpoint to use local Rerank.
2.5 Impact of Chunking Strategy on Recall Quality
Chunking is severely underestimated in the RAG pipeline. The chunking approach directly determines the ceiling of recall quality.
Fixed-size Chunking
- Config: chunk_size=512 tokens, overlap=50 tokens
- Problem: sentences may be split mid-way, losing context
- Best for: well-structured documents (API docs, FAQs)
Semantic Chunking
- Splits based on semantic cohesion; same topic stays together
- Supported in Dify 0.10+
- Effect: approximately 15โ25% Recall@5 improvement at equal Top-K
Structural Chunking
- Splits by document structure: headings, paragraphs, lists
- Best for documents with clear hierarchy (technical manuals, legal documents)
- In Dify: enable "Parent-Child Chunking" โ retrieve small chunks but send large chunks to the model
Parent-Child Chunking in Practice:
Document structure:
Chapter 3: Refund Policy [Parent chunk = full chapter]
3.1 Standard User Refunds [Child chunk = small paragraph]
3.2 VIP User Refunds [Child chunk = small paragraph]
3.3 Special Product Rules [Child chunk = small paragraph]
Retrieval: use fine-grained child chunks for vector search (higher precision)
Model input: after finding child chunk, pass its parent chunk (preserves full context)
In Dify: Knowledge Base โ Document โ Segmentation โ Select "Parent-Child Segmentation Mode."
Level 3: Source Code and Principles (5+ Years Experience)
3.1 Dify RAG Pipeline Source Code Analysis
Dify's RAG retrieval pipeline lives in api/core/rag/. Core flow:
DatasetRetrieval
โโโ retrieve()
โ โโโ _single_retrieve() # Single knowledge base retrieval
โ โโโ _multi_retrieve() # Multi knowledge base retrieval
โ โโโ Concurrent retrieval across datasets
โ
โโโ Vector retrieval path
โ โโโ VectorIndex.search()
โ โโโ embed_query() # Vectorize query
โ โโโ vector_store.search() # ANN search
โ
โโโ Full-text retrieval path
โ โโโ KeywordIndex.search()
โ โโโ BM25 implementation
โ
โโโ Rerank path
โโโ RerankRunner.run()
โโโ Call Rerank API
โโโ Re-sort by Rerank scores
Key code path (api/core/rag/datasource/retrieval_service.py):
class RetrievalService:
@classmethod
def retrieve(cls, retrieval_method: str, dataset_id: str,
query: str, top_k: int, score_threshold: float,
reranking_model: dict = None) -> list[Document]:
if retrieval_method == RetrievalMethod.HYBRID_SEARCH.value:
# Execute vector and keyword search concurrently
with ThreadPoolExecutor() as executor:
vector_future = executor.submit(
cls._vector_search, dataset_id, query, top_k * 2
)
keyword_future = executor.submit(
cls._keyword_search, dataset_id, query, top_k * 2
)
vector_results = vector_future.result()
keyword_results = keyword_future.result()
# RRF fusion
results = cls._reciprocal_rank_fusion(
[vector_results, keyword_results], top_k
)
# Rerank refinement
if reranking_model and len(results) > 0:
results = RerankRunner(reranking_model).run(
query, results, score_threshold, top_k
)
return results
3.2 Vector Index Internals: pgvector vs Qdrant vs Weaviate
Dify supports multiple vector databases; their implementation differences significantly affect performance:
pgvector (PostgreSQL extension)
- Index type: HNSW (Hierarchical Navigable Small World)
- Parameters:
m=16, ef_construction=64(graph density and build-time search width) - Character: co-located with Dify's main database, simple operations; sufficient for up to about 1M vectors
- Search complexity: O(log N) approximate
-- Dify's pgvector index creation
CREATE INDEX ON embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- ef_search at query time (higher = more accurate but slower)
SET hnsw.ef_search = 100;
Qdrant
- Index type: HNSW + payload filtering
- Feature: Filter by metadata during vector search (e.g., "only search 2024 documents")
- Performance: Clear advantage above 10M vectors
- Dify config: set
VECTOR_STORE=qdrantand configureQDRANT_URL
Weaviate
- Unique feature: native BM25 + vector hybrid search (Dify does not need to implement RRF separately)
- GraphQL query interface, suitable for complex filtering scenarios
- Mature multi-tenancy support, ideal for SaaS deployments
Performance benchmark (1M vectors, dim=1536, Top-10):
| Database | Latency P50 | Latency P99 | QPS |
|---|---|---|---|
| pgvector (HNSW) | 12ms | 45ms | 500 |
| Qdrant | 5ms | 18ms | 2,000 |
| Weaviate | 8ms | 30ms | 1,200 |
3.3 RAG Evaluation Framework: RAGAS Integration
RAGAS (RAG Assessment) is an evaluation framework specifically for RAG systems, automating computation of core metrics:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision
)
from datasets import Dataset
# Prepare evaluation data
eval_data = {
"question": ["What is the refund policy?", "How do I become a VIP?"],
"answer": ["7-day no-questions-asked return", "Spend $1000/year to qualify"],
"contexts": [
["Refund policy: within 7 days of purchase...", "Exclusions apply..."],
["VIP qualification: annual spending of $1000..."]
],
"ground_truth": [
"7-day return policy",
"Annual spending of $1000 triggers automatic VIP upgrade"
]
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(
dataset=dataset,
metrics=[
faithfulness, # Is the answer grounded in documents?
answer_relevancy, # Does the answer address the question?
context_recall, # Were relevant contexts retrieved?
context_precision, # How precise is the retrieved context?
]
)
print(result)
# {'faithfulness': 0.91, 'answer_relevancy': 0.88,
# 'context_recall': 0.79, 'context_precision': 0.85}
Integrate RAGAS into CI/CD pipelines to automatically run evaluations on every knowledge base configuration change, preventing quality regressions.
Level 4: Production Pitfalls and Decision-Making (Expert Perspective)
4.1 Pitfall 1: Recall Failure from Ignoring Query Rewriting
Problem: User phrasing and document wording may be entirely different.
User asks: "How much does this thing cost?" (colloquial, missing context) Document says: "Pricing plans: Basic Plan $299/month..."
Vector search performs poorly here because the semantic vectors diverge significantly.
Solution: Query Rewriting
Before retrieval, use an LLM to rewrite and expand the user's query:
QUERY_REWRITE_PROMPT = """
You are a search query optimization expert. Rewrite the user's question
into a form better suited for document retrieval.
Requirements:
1. Fill in missing context (resolve pronouns to explicit terms)
2. Generate 2-3 semantically similar query variants
3. Output in JSON format
User question: {query}
Output format:
{
"rewritten": "primary rewritten query",
"variants": ["variant 1", "variant 2"]
}
"""
Implement in a Dify workflow: LLM node (query rewriting) then Knowledge base nodes (parallel retrieval with multiple variants) then deduplicate and merge results.
This approach typically improves Recall@5 by 20โ40% on colloquial questions.
4.2 Pitfall 2: Rerank Computation Explosion
Rerank computation scales as O(queries multiplied by candidates). With Top-K set to 50, Rerank must compute 50 relevance scores per query, causing latency to spike dramatically.
Wrong configuration:
Top-K = 50 โ Rerank 50 documents โ P99 latency > 2 seconds
Correct approach โ two-stage retrieval:
Stage 1 (vector search): Top-K = 30 (broad recall)
Stage 2 (Rerank): Input 30 documents, output Top-5
Final to model: 5 high-quality documents
Cohere Rerank v3 latency is approximately 150ms for 30 documents vs 50ms for 5 โ a significant difference that compounds under load.
4.3 Pitfall 3: Stale Vectors After Document Updates
Symptom: You updated a document's content, but the system still retrieves the old version.
Root cause: Dify does not automatically re-index existing documents. After modifying a document, you must manually trigger re-indexing.
Production solution:
# Detect documents needing re-indexing via Dify API
import hashlib
import requests
import os
def check_and_reindex(dataset_id, document_path, api_key):
with open(document_path, 'rb') as f:
current_hash = hashlib.md5(f.read()).hexdigest()
# Fetch document info from Dify
doc_info = requests.get(
f"{DIFY_BASE_URL}/datasets/{dataset_id}/documents",
headers={"Authorization": f"Bearer {api_key}"}
).json()
for doc in doc_info['data']:
if doc['name'] == os.path.basename(document_path):
stored_hash = doc.get('custom_metadata', {}).get('md5')
if stored_hash != current_hash:
trigger_reindex(dataset_id, doc['id'], api_key)
update_doc_metadata(dataset_id, doc['id'], current_hash, api_key)
Establish document hash tracking with a daily scheduled job to keep the knowledge base synchronized with source files.
4.4 Pitfall 4: Data Contamination in Multi-tenant Scenarios
In SaaS scenarios, different customers' data must be strictly isolated โ Customer A's queries must never retrieve Customer B's documents.
Dify's isolation approach: Create a separate Dataset per customer; at query time, specify only that customer's Dataset ID. This is the safest isolation method.
But when customer count reaches hundreds, maintaining many Datasets becomes operationally expensive.
Alternative: Metadata Filtering (supported by Qdrant and Weaviate):
# Qdrant payload filtering for multi-tenancy
from qdrant_client.http.models import Filter, FieldCondition, MatchValue
search_result = qdrant_client.search(
collection_name="all_documents",
query_vector=query_embedding,
query_filter=Filter(
must=[
FieldCondition(
key="tenant_id",
match=MatchValue(value=current_tenant_id)
)
]
),
limit=10
)
Note: Metadata filtering requires additional application-layer encapsulation in Dify โ more complex than Dataset isolation, but with lower long-term operational overhead at scale.
4.5 Decision Framework: Choosing the Right RAG Configuration
What is your scenario?
โ
โโโ Documents < 100K tokens + limited budget
โ โ pgvector + hybrid search + BGE-Reranker (local deployment)
โ
โโโ Documents > 1M tokens + high concurrency
โ โ Qdrant + hybrid search + Cohere Rerank
โ
โโโ Multilingual documents (mixed Chinese/English)
โ โ bge-m3 embedding + Cohere multilingual Rerank
โ
โโโ Need exact keyword matching (product codes, etc.)
โ โ Hybrid search (increase BM25 weight) + Rerank
โ
โโโ Extreme latency requirements (P99 < 500ms)
โ Pure vector search (skip Rerank) + lower Top-K
Chapter Summary
RAG tuning is a systems engineering effort with no single silver bullet. Key takeaways:
Metric-driven approach: Build an evaluation benchmark first, quantify current quality with Recall@K and MRR, then optimize deliberately โ not by intuition.
Retrieval strategy: Default to hybrid search (vector + BM25) in production โ on average 15% higher Recall@5 than single-path retrieval.
Rerank is essential: Adding Rerank on top of hybrid search typically improves Precision@5 by another 20โ30%; local BGE Reranker is the best cost-efficiency option.
Upstream quality sets the ceiling: Embedding model selection and chunking strategy determine the recall ceiling โ get these right before optimizing downstream stages.
Production checklist:
- Hybrid search enabled
- Rerank model configured (recommend BGE-m3 local or Cohere)
- Top-K calibrated (20โ30 for initial retrieval, 5โ10 after Rerank)
- Score threshold tested (avoid filtering out genuinely relevant content)
- Chunking strategy matches document type
- Evaluation benchmark established and run regularly
- Document update monitoring deployed