Building Knowledge Bases: Document Processing, Chunking and Index Optimization
Chapter 6: Building Knowledge Bases โ Document Processing, Chunking Strategies, and Index Optimization
Knowledge base quality determines the ceiling of RAG applications. The right document processing and chunking strategy can improve the same AI model's answer quality by 30-50%.
Chapter Overview
Many developers treat Dify's knowledge base as a "file upload" feature: upload documents, wait for processing to complete, then start using it. This approach gets the system running, but rarely gets it performing well.
Building the knowledge base is the decisive step for RAG application quality. Document quality, chunking strategy, Embedding model selection, and index parameter configuration โ every step significantly affects retrieval quality, which in turn affects the accuracy of AI answers.
This chapter systematically explains how to build a high-quality knowledge base from a practical standpoint. We'll cover the complete pipeline from document preparation to production operation, including details that aren't prominently highlighted in the Dify interface but are critically important.
By the end of this chapter, you will be able to:
- Choose appropriate preprocessing strategies for different document types
- Understand the principles behind chunking strategies and make sound parameter choices
- Select the right Embedding model for different scenarios
- Optimize Weaviate's index configuration for best performance
- Establish version management and continuous maintenance mechanisms for knowledge bases
- Diagnose and solve common knowledge base quality problems
Level 1: Foundational Understanding (1-3 Years Experience)
The Complete Knowledge Base Building Pipeline
Building a high-quality knowledge base requires going through these phases:
Phase 1: Document Preparation
โ Collect, clean, and format raw documents
Phase 2: Document Upload and Pre-processing
โ Dify parses document format, extracts plain text
Phase 3: Text Chunking
โ Split long documents into chunks suitable for retrieval
Phase 4: Vectorization
โ Embedding model converts text to vectors
Phase 5: Index Building
โ Vectors stored in vector database, retrieval index built
Phase 6: Validation and Tuning
โ Test retrieval performance, adjust parameters
Phase 7: Continuous Maintenance
โ Document updates, additions, deletions
Decisions at each phase affect the final result. Let's dive into each one.
Document Preparation: Quality Determines the Ceiling
The quality of documents uploaded to the knowledge base is the quality ceiling for the entire RAG system. Garbage in, garbage out.
Common document quality problems and solutions:
| Problem | Symptom | Solution |
|---|---|---|
| Scanned PDF OCR errors | Garbled text, wrong numbers | Use high-quality OCR tools or manually correct |
| Formatting issues | Table data parsed as scrambled text | Convert to Markdown or structured text |
| Redundant content | Headers, footers, copyright notices consuming many tokens | Remove during preprocessing |
| Duplicate content | Multiple document versions coexisting | Establish version management, keep only latest |
| Low information density | Large amounts of whitespace, images, decorative content | Extract core text content |
Recommended document formats (best to worst):
1. Markdown (.md) โ Best
- Clear structure (headings, lists, code blocks)
- Excellent chunking results (can split by heading level)
- High text density, no format noise
2. Plain text (.txt)
- No format noise
- Lacks structural information, chunking relies on content patterns
3. Word (.docx)
- Dify can parse basic formatting
- Complex elements like tables and images have limited parsing accuracy
4. PDF
- Native PDF (digitally created): Good parsing results
- Scanned PDF: Requires OCR; quality depends on OCR engine quality
5. HTML
- Dify removes HTML tags, extracts plain text
- Navigation bars, ads, and other noise need preprocessing removal
Document preprocessing script example:
import re
from pathlib import Path
def preprocess_document(input_path: str, output_path: str):
"""
Generic document preprocessing: remove common noise content
"""
with open(input_path, 'r', encoding='utf-8') as f:
content = f.read()
# 1. Remove consecutive blank lines (replace 3+ blank lines with 1)
content = re.sub(r'\n{3,}', '\n\n', content)
# 2. Remove common header/footer patterns (adjust for your documents)
patterns_to_remove = [
r'Page \d+ of \d+', # "Page X of Y"
r'Copyright ยฉ.*?All rights reserved', # Copyright notices
r'www\.[a-zA-Z0-9-]+\.[a-zA-Z]{2,}', # URLs
r'confidential|internal use only', # Confidentiality markers
]
for pattern in patterns_to_remove:
content = re.sub(pattern, '', content, flags=re.IGNORECASE)
# 3. Normalize whitespace
content = re.sub(r' {2,}', ' ', content) # Multiple spaces to single
# 4. Strip leading/trailing whitespace
content = content.strip()
with open(output_path, 'w', encoding='utf-8') as f:
f.write(content)
print(f"Processed: {Path(input_path).name} โ {Path(output_path).name}")
print(f"Original size: {Path(input_path).stat().st_size} bytes")
print(f"Processed size: {Path(output_path).stat().st_size} bytes")
Choosing a Chunking Strategy
Chunking is the process of splitting long documents into small segments. The goal of chunking: each segment should be independently complete, sufficient to answer a specific type of question.
Dify's built-in chunking options:
When creating a knowledge base, choose "Automatic" or "Custom" chunking:
Automatic chunking (recommended for beginners):
- Dify automatically identifies document structure
- Splits at paragraph and semantic boundaries
- Works well for most cases
Custom chunking (recommended for advanced users):
Parameters:
โโโ Chunk size: maximum characters per chunk (default 500)
โโโ Chunk overlap: repeated characters between adjacent chunks (default 50)
โโโ Separator: what delimiter to use for splitting (e.g., \n\n for paragraphs)
Chunk size selection guide:
| Document Type | Recommended Chunk Size | Recommended Overlap | Reason |
|---|---|---|---|
| Technical docs (long-form) | 800-1000 chars | 100 chars | Technical concepts need sufficient context |
| Product manuals | 500-600 chars | 50 chars | Standard size, stable performance |
| FAQ Q&A | 200-300 chars | 0-30 chars | Each QA pair is inherently independent |
| Legal text | 600-800 chars | 100 chars | Legal clauses need complete citation |
| News articles | 400-500 chars | 50 chars | News paragraphs are usually self-contained |
An important practical principle: Chunk size should match the expected complexity of questions. Users asking simple questions ("What is the price?") need small chunks; users asking complex questions ("What is the technical architecture of this product?") need larger chunks.
Complete Walkthrough: Creating a Knowledge Base in Dify
Step 1: Enter the Knowledge Base module
Click "Knowledge" in the Dify left navigation โ "Create Knowledge"
Step 2: Choose a data source
Data source options:
โโโ Upload files (focus of this chapter)
โโโ Sync via URL (web page content)
โโโ Connect Notion (enterprise users)
Step 3: Choose chunking and cleaning strategy
In "Data Processing Method" select:
- Automatic segmentation and cleaning (simple scenarios)
- Custom segmentation rules (advanced scenarios)
Custom segmentation configuration example:
Segment identifier: \n\n (split by blank line)
Maximum segment length: 600 characters
Segment overlap: 60 characters
Text preprocessing rules:
โ Remove all URLs and email addresses
โ Remove all HTML tags
โ Replace consecutive spaces with single space
Step 4: Choose indexing mode
High Quality:
- Uses LLM to summarize segment content (QA pairs)
- Uses both vector index and keyword index
- Best results, consumes tokens
Economical:
- Directly vectorizes document content
- Uses inverted index only (keywords)
- Low cost, slightly weaker performance
Recommendation: Choose "High Quality" mode for the vast majority of scenarios. Token consumption is acceptable (processing 100 pages consumes roughly 20,000-50,000 tokens, costing less than $1).
Step 5: Choose Embedding model
Recommended choices:
โโโ text-embedding-3-small (OpenAI) โ Low cost, good quality, multilingual
โโโ bge-m3 (local deployment) โ Best multilingual performance, no API fee
โโโ text-embedding-3-large (OpenAI) โ Best quality, higher cost
Level 2: Mechanism Deep Dive (3-5 Years Experience)
Understanding Dify's Document Processing Pipeline
After you upload a document, Dify executes this processing pipeline in the background:
# Dify document processing pipeline (pseudocode)
class DocumentProcessingPipeline:
def process(self, file: UploadedFile, dataset: Dataset, config: IndexingConfig):
# Step 1: Document parsing (convert binary file to plain text)
parser = self.get_parser(file.extension) # PDF parser, Word parser, etc.
raw_text = parser.parse(file.content)
# Step 2: Text cleaning
cleaner = TextCleaner(config.cleaning_rules)
cleaned_text = cleaner.clean(raw_text)
# Step 3: Text chunking
splitter = TextSplitter(
chunk_size=config.segment_max_tokens,
chunk_overlap=config.segment_overlap,
separator=config.separator
)
chunks = splitter.split(cleaned_text)
# Step 4: Vectorization (batch processing for efficiency)
embedding_model = self.get_embedding_model(dataset.embedding_model_id)
embeddings = []
for batch in self.batch_chunks(chunks, batch_size=100):
batch_texts = [chunk.text for chunk in batch]
batch_embeddings = embedding_model.encode(batch_texts)
embeddings.extend(batch_embeddings)
# Step 5: Storage
for chunk, embedding in zip(chunks, embeddings):
# Store in vector database
self.vector_db.upsert(chunk, embedding, dataset.id)
# Store in relational database (for management and keyword retrieval)
self.relational_db.insert(chunk, dataset.id)
# Step 6: Update document status
self.update_document_status(file.id, status=DocumentStatus.COMPLETED)
Document format parsing quality:
PDF parsing quality depends on PDF type:
Native PDF (digitally created):
- Text extraction accuracy: > 99%
- Tables: Mostly extractable, but formatting may be lost
- Mathematical formulas: Usually unrecognizable
Scanned PDF (image-based):
- Relies on OCR; accuracy: 70-95% (depends on scan quality and OCR engine)
- Dify doesn't do OCR by default โ must enable in configuration
- Enabling OCR requires configuring an OCR service (e.g., Azure Form Recognizer)
Improving PDF parsing quality:
1. Use PyMuPDF (fitz) for text extraction
2. Use pdfplumber for table extraction
3. Use Adobe PDF Services API for high-quality OCR
Parent-Child Chunking Strategy
Dify v0.10+ introduced parent-child chunking โ an advanced strategy that significantly improves retrieval quality.
Core idea:
- Child chunks: Small segments (200-300 chars) for precise retrieval (vector similarity is more accurate on smaller chunks)
- Parent chunks: Large segments (1000-2000 chars) passed to LLM as context (contain more complete information)
Retrieval process:
- Use user's question to retrieve child chunks (precise localization)
- Once child chunk is found, retrieve its corresponding parent chunk
- Pass parent chunk content to LLM (complete context)
Effectiveness comparison (using technical documentation as example):
Traditional chunking (500-char chunks):
Retrieved: 500-char segment, possibly cut in the middle of a sentence
Problem: Incomplete context, LLM cannot give a complete answer
Parent-child chunking (200-char child + 1500-char parent):
Child chunk retrieval precisely locates relevant paragraph
Parent chunk provides complete section/chapter context
LLM gives accurate answer based on complete information
Measured results:
Traditional chunking Recall@5: 78%
Parent-child chunking Recall@5: 87% (~12% improvement)
Configuring parent-child chunking in Dify:
Knowledge Base Settings โ Segmentation Method โ Select "Parent-Child Segmentation":
Parent chunk settings:
Chunk size: 1500 characters
Separator: \n\n (by paragraph)
Child chunk settings:
Chunk size: 200 characters
Overlap: 20 characters
The Importance of Metadata
Adding metadata to document chunks in the knowledge base can greatly improve retrieval precision and usability.
Metadata types supported by Dify:
# Adding metadata when uploading documents via Dify API
metadata = {
"document_type": "product_manual", # Document type
"version": "2.0", # Version number
"effective_date": "2024-01-01", # Effective date
"department": "engineering", # Department
"confidentiality": "public", # Confidentiality level
"language": "en", # Language
}
# Upload document
response = requests.post(
f"{DIFY_API_URL}/datasets/{dataset_id}/documents/create-by-file",
headers={"Authorization": f"Bearer {API_KEY}"},
files={"file": open("product_manual_v2.pdf", "rb")},
data={
"indexing_technique": "high_quality",
"doc_metadata": json.dumps(metadata)
}
)
Metadata-based retrieval filtering (in a workflow):
Knowledge Retrieval node configuration:
Retrieval query: {{user_question}}
Filter conditions:
- document_type == "product_manual" (only retrieve product manuals)
- version == "2.0" (only retrieve latest version)
- language == "en" (only retrieve English documents)
This approach enables precise knowledge base isolation โ particularly suited for enterprise knowledge base management across multiple products and versions.
Best Practices for Document Updates
A universal challenge in enterprise knowledge bases: documents need continuous updates. How do you update the knowledge base without disrupting service?
Option 1: Direct replacement (simple scenarios)
# Update document via Dify API
# Step 1: Delete old document
curl -X DELETE \
-H "Authorization: Bearer $API_KEY" \
"https://api.dify.ai/v1/datasets/$DATASET_ID/documents/$OLD_DOC_ID"
# Step 2: Upload new document
curl -X POST \
-H "Authorization: Bearer $API_KEY" \
-F "file=@new_document.pdf" \
-F "indexing_technique=high_quality" \
"https://api.dify.ai/v1/datasets/$DATASET_ID/documents/create-by-file"
Note: There's a time window between deletion and re-upload during which the document content isn't in the knowledge base. For critical documents, this may cause query failures.
Option 2: Blue-Green Knowledge Base (zero-downtime updates)
class KnowledgeBaseUpdater:
"""
Blue-green knowledge base update strategy:
Maintain two knowledge bases (blue: current production, green: pending update)
Operate on green during updates, then switch traffic to green when done
"""
def __init__(self, dify_client):
self.client = dify_client
self.blue_dataset_id = "dataset-blue-xxx" # Current production KB
self.green_dataset_id = "dataset-green-xxx" # Pending update KB
self.active = "blue" # Currently active KB
def update_knowledge_base(self, new_documents: list[str]):
# 1. Update documents in the inactive KB (green)
inactive = self.green_dataset_id if self.active == "blue" else self.blue_dataset_id
# Clear old content
self.client.clear_dataset(inactive)
# Upload new documents
for doc_path in new_documents:
self.client.upload_document(inactive, doc_path)
# Wait for indexing to complete
self.wait_for_indexing_complete(inactive)
# Validate new KB (run test queries)
if self.validate_knowledge_base(inactive):
# Switch traffic: update application's KB ID configuration
self.switch_traffic(inactive)
self.active = "green" if self.active == "blue" else "blue"
print(f"Successfully switched to knowledge base: {inactive}")
else:
print("Validation failed, keeping original knowledge base")
def validate_knowledge_base(self, dataset_id: str) -> bool:
"""Run validation queries to ensure new KB works correctly"""
test_queries = [
"export data formats",
"how to register account",
"password reset steps",
]
for query in test_queries:
results = self.client.retrieve(dataset_id, query, top_k=3)
if not results or results[0].score < 0.5:
print(f"Validation failed: query '{query}' didn't get high-quality results")
return False
return True
Level 3: Source Code and Principles (5+ Years Experience)
Dify Document Chunking Implementation Details
Dify uses LangChain's RecursiveCharacterTextSplitter as the base chunking implementation (though progressively migrating to its own implementation, the core algorithm is the same):
# Dify text chunking implementation (related logic in api/core/indexing_runner.py)
class FixedRecursiveCharacterTextSplitter:
"""
Recursive character text splitter:
Tries separators in priority order, using the highest-priority separator first.
If chunks are still too large, falls back to the next separator.
"""
DEFAULT_SEPARATORS = [
"\n\n", # First try splitting by paragraph
"\n", # Then by newline
". ", # Then by period + space (sentence boundary)
".", # Then by period
";", # Then by semicolon
" ", # Then by space
"", # Finally, force-split by character count
]
def __init__(self, chunk_size: int, chunk_overlap: int, separators: list[str] = None):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.separators = separators or self.DEFAULT_SEPARATORS
def split_text(self, text: str) -> list[str]:
return self._split_text(text, self.separators)
def _split_text(self, text: str, separators: list[str]) -> list[str]:
# Try the current priority separator
separator = separators[0]
splits = text.split(separator)
chunks = []
current_chunk = ""
for split in splits:
if len(current_chunk) + len(split) + len(separator) <= self.chunk_size:
current_chunk += split + separator
else:
if current_chunk:
chunks.append(current_chunk.strip())
# If a single piece is still too large, use next separator recursively
if len(split) > self.chunk_size and len(separators) > 1:
sub_chunks = self._split_text(split, separators[1:])
chunks.extend(sub_chunks[:-1])
current_chunk = sub_chunks[-1] + separator
else:
current_chunk = split + separator
if current_chunk.strip():
chunks.append(current_chunk.strip())
# Handle overlap: add overlapping content between adjacent chunks
if self.chunk_overlap > 0:
chunks = self._add_overlap(chunks)
return chunks
def _add_overlap(self, chunks: list[str]) -> list[str]:
"""Add overlap content between adjacent chunks"""
if len(chunks) <= 1:
return chunks
overlapped_chunks = [chunks[0]]
for i in range(1, len(chunks)):
prev_chunk = chunks[i - 1]
current_chunk = chunks[i]
# Take the last `overlap` characters of previous chunk as prefix
overlap_text = prev_chunk[-self.chunk_overlap:]
overlapped_chunks.append(overlap_text + current_chunk)
return overlapped_chunks
Understanding why recursion is needed:
Consider this situation: a paragraph has 2,000 characters, exceeding the 500-character chunk limit. If split by \n\n, this paragraph won't be cut. The algorithm then falls back to \n, then to periods, and so on, until it can produce appropriately sized chunks. This recursive strategy guarantees that chunks never exceed chunk_size in any situation, while preserving semantic integrity as much as possible.
Optimizing Embedding Batch Processing
Dify uses batch requests during document vectorization to reduce API calls and latency:
class EmbeddingBatchProcessor:
"""
Batch processing for document vectorization, optimizing API call efficiency
"""
def __init__(self, model, batch_size: int = 100):
self.model = model
self.batch_size = batch_size
def process_chunks(self, chunks: list[DocumentChunk]) -> list[DocumentChunk]:
"""
Batch vectorize document chunks
OpenAI text-embedding-3-small limits:
- Max 2,048 texts per request
- Max 8,191 tokens per text
- Rate limit: 1,000,000 TPM
"""
processed_chunks = []
for i in range(0, len(chunks), self.batch_size):
batch = chunks[i:i + self.batch_size]
batch_texts = [chunk.content for chunk in batch]
try:
embeddings = self.model.embed_documents(batch_texts)
except RateLimitError:
time.sleep(60) # Wait then retry on rate limit
embeddings = self.model.embed_documents(batch_texts)
for chunk, embedding in zip(batch, embeddings):
chunk.embedding = embedding
processed_chunks.append(chunk)
progress = (i + len(batch)) / len(chunks) * 100
print(f"Vectorization progress: {progress:.1f}% ({i + len(batch)}/{len(chunks)})")
return processed_chunks
Speed reference (text-embedding-3-small, 1,000 document chunks, 100 tokens each):
Small batch (batch_size=10):
API calls: 100
Total time: ~30 seconds (heavily affected by network latency)
Large batch (batch_size=100):
API calls: 10
Total time: ~8 seconds (network latency amortized)
Maximum batch (batch_size=2048):
API calls: 1 (theoretically)
Total time: ~3-5 seconds (limited by API processing time)
Note: Larger batches aren't always better โ very large batches may hit token limits
Weaviate Data Model and Query Optimization
Understanding Weaviate's data model helps with performance tuning when needed:
# Full configuration of Weaviate collection (used when Dify creates knowledge base)
collection_config = {
"class": "DatasetXxx",
"description": "Dify dataset collection",
# Vector index configuration
"vectorIndexConfig": {
"distance": "cosine", # Use cosine similarity
"ef": 64, # Dynamic candidate list size
"efConstruction": 128, # Candidate list size during index building
"maxConnections": 64, # Max connections per node
"vectorCacheMaxObjects": 1000000, # Vector cache object count
"dynamicEfFactor": 8,
"dynamicEfMin": 25,
"dynamicEfMax": 500,
},
# Data property definitions
"properties": [
{
"name": "text",
"dataType": ["text"],
"indexInverted": True, # Enable keyword index (for hybrid search)
},
{
"name": "doc_id",
"dataType": ["string"],
"indexInverted": False, # Used for filtering only, no keyword index needed
},
# ... other metadata fields
],
# Inverted index configuration (for BM25 full-text retrieval)
"invertedIndexConfig": {
"bm25": {
"b": 0.75, # BM25 b parameter (document length normalization factor)
"k1": 1.2, # BM25 k1 parameter (term frequency saturation)
},
"stopwords": {
"preset": "en", # English stopwords
},
"cleanupIntervalSeconds": 60,
}
}
Level 4: Production Pitfalls and Decision Making (Expert Perspective)
Pitfall 1: Knowledge Base "Overload" Problem
A very common mistake: putting all content into a single knowledge base, expecting AI to find answers to any question from it.
Problem: When a knowledge base contains more than 10,000 document chunks, retrieval precision starts declining. Reasons:
- More chunks in vector space means more "noise"
- The same word may have different meanings in different contexts, interfering with retrieval
- In BM25 retrieval, IDF values for common words get pulled down
Solution: Layered knowledge base architecture
Recommended architecture:
Layer 1: General knowledge base (company-wide shared)
- Company overview, product overview, policies and regulations
- Foundational content that updates infrequently
- Size: < 1,000 document chunks
Layer 2: Product knowledge bases (split by product line)
- Product A manual knowledge base
- Product B manual knowledge base
- Each independent, reducing cross-interference
Layer 3: Department-specific knowledge bases
- Technical documentation library (for engineers)
- Sales knowledge base (for sales team)
- Customer service scripts (for support team)
Multi-knowledge-base routing in Dify (using Workflow):
Workflow nodes:
[Start] โ Receive question
โ
[LLM: Question Classification] โ Determine question type
{product_a, product_b, general, technical}
โ
[IF/ELSE Branch]
โโโ product_a โ [Knowledge Retrieval: Product A KB]
โโโ product_b โ [Knowledge Retrieval: Product B KB]
โโโ technical โ [Knowledge Retrieval: Technical Docs KB]
โโโ general โ [Knowledge Retrieval: General KB]
โ
[LLM: Generate Answer]
Pitfall 2: Silent Failure of Document Parsing
When Dify processes documents, if a document fails to parse, it silently marks it as "error" status without stopping other documents from processing, and without actively notifying you.
How to discover parsing failures:
# Check document status in knowledge base via API
curl -H "Authorization: Bearer $API_KEY" \
"https://api.dify.ai/v1/datasets/$DATASET_ID/documents?page=1&limit=50" \
| python3 -c "
import sys, json
data = json.load(sys.stdin)
for doc in data.get('data', []):
if doc.get('indexing_status') != 'completed':
print(f\"FAILED: {doc.get('name')} - Status: {doc.get('indexing_status')}\")
"
Common parsing failure causes and solutions:
| Error Cause | Symptom | Solution |
|---|---|---|
| Scanned PDF without OCR enabled | indexing_status: error | Enable OCR config or manually extract text |
| File exceeds size limit | Upload fails immediately | Split large files or compress |
| Special character encoding issues | Garbled text | Convert to UTF-8 encoding |
| Password-protected PDF | Cannot parse | Remove PDF password protection |
| Corrupted Office file | indexing_status: error | Repair or re-export file |
Build indexing monitoring:
import requests
import time
def monitor_indexing_status(dataset_id: str, api_key: str, expected_count: int):
"""
Monitor knowledge base document indexing status until all documents are processed
"""
headers = {"Authorization": f"Bearer {api_key}"}
while True:
response = requests.get(
f"https://api.dify.ai/v1/datasets/{dataset_id}/documents",
headers=headers,
params={"page": 1, "limit": 100}
)
documents = response.json().get("data", [])
completed = [d for d in documents if d["indexing_status"] == "completed"]
failed = [d for d in documents if d["indexing_status"] == "error"]
processing = [d for d in documents if d["indexing_status"] in
["waiting", "parsing", "cleaning", "splitting", "indexing"]]
print(f"Progress: {len(completed)}/{expected_count} complete, "
f"{len(failed)} failed, {len(processing)} processing")
if failed:
for doc in failed:
print(f" [ERROR] {doc['name']}: {doc.get('error', 'Unknown error')}")
if len(completed) + len(failed) >= expected_count:
print("All documents processed!")
break
time.sleep(30) # Check every 30 seconds
Pitfall 3: Knowledge Base "Memory Drift"
After running in production for a while, the knowledge base may exhibit "memory drift":
- The same question gets different answers today compared to three months ago
- Even though documents were updated, AI still provides old version information
Root cause investigation checklist:
โก Verify document update succeeded
โ Check document updated_at timestamp via API
โก Verify old documents were deleted
โ If new documents were uploaded without deleting old ones, old content still exists
โก Check for orphaned data in vector database
โ Sometimes Dify metadata deletes succeed but vector data isn't synchronized
โก Verify Embedding model wasn't changed
โ After changing Embedding model, old vectors are invalid โ must rebuild index
โก Check knowledge base cache
โ Dify has short-term retrieval result cache (typically a few minutes)
โ Recently updated content may still be cached
Build a scheduled knowledge base health check:
import schedule
import time
def knowledge_base_health_check():
"""Daily health check for knowledge base at 2 AM"""
# 1. Check document count meets expectations
doc_count = get_document_count(DATASET_ID)
if doc_count < EXPECTED_MIN_DOCS:
alert(f"Knowledge base document count below expected: {doc_count} < {EXPECTED_MIN_DOCS}")
# 2. Run standard test queries
test_cases = [
("export formats", "CSV"), # query, expected keyword in results
("register account", "email"),
("password reset", "24 hours"),
]
for query, expected_keyword in test_cases:
results = retrieve(DATASET_ID, query, top_k=3)
if not any(expected_keyword.lower() in r.text.lower() for r in results):
alert(f"Retrieval quality degraded: query '{query}' "
f"didn't find chunks containing '{expected_keyword}'")
# 3. Check average retrieval score
scores = [r.score for r in retrieve(DATASET_ID, "test query", top_k=5)]
avg_score = sum(scores) / len(scores) if scores else 0
if avg_score < 0.5:
alert(f"Average retrieval score too low: {avg_score:.3f}")
# Set up scheduled task
schedule.every().day.at("02:00").do(knowledge_base_health_check)
while True:
schedule.run_pending()
time.sleep(60)
Knowledge Base Performance Benchmarks and Budget Planning
Processing time benchmarks (different knowledge base sizes, Dify default config):
Scale: 1,000 document chunks (~100 pages)
Vectorization time: ~1-3 minutes
Total processing time (including parsing, storage): ~3-8 minutes
Embedding cost (text-embedding-3-small): ~$0.002
Scale: 10,000 document chunks (~1,000 pages)
Vectorization time: ~10-30 minutes
Total processing time: ~20-60 minutes
Embedding cost: ~$0.02
Scale: 100,000 document chunks (~10,000 pages)
Vectorization time: ~1-3 hours
Total processing time: ~2-6 hours
Embedding cost: ~$0.2
Scale: 1,000,000 document chunks (~100,000 pages)
May need to evaluate Weaviate capacity (default config may need scaling)
Vectorization time: ~10-30 hours
Embedding cost: ~$2
Vector storage space estimation:
Storage space per document chunk:
Vector (1536 dims ร 4 bytes): 6,144 bytes โ 6KB
Text (avg 500 chars): ~1KB
Metadata: ~0.5KB
Total: ~7.5KB per chunk
Storage requirement for 100,000 chunks:
Vector database (Weaviate): ~750MB
PostgreSQL (metadata): ~100MB
Total: ~1GB (with index overhead ~1.5-2GB)
Chapter Summary
A high-quality knowledge base is not "upload documents and you're done" โ it requires careful design at multiple steps: document preparation, chunking strategy, Embedding selection, index configuration, and continuous maintenance.
Key Takeaways:
- Document quality is the ceiling: Garbage in, garbage out; time spent on document preprocessing yields immediate improvement
- Chunking strategy must match the business: Small chunks for FAQ, large chunks for technical docs, parent-child chunking for complex scenarios
- Metadata is the key to precise routing: Add version, type, and other metadata to documents; filter during retrieval for higher precision
- Monitoring and maintenance are ongoing work: Knowledge bases aren't a one-time build โ they need continuous updating and optimization as business evolves
- Knowledge bases need layered design: Don't stuff all content into one knowledge base; manage by business domain, product, and department
This chapter concludes the knowledge base section of the Dify book. At this point, you have mastered the complete knowledge system for building a high-quality knowledge Q&A system from scratch: from Dify's foundational concepts (Chapters 1-2), to quick start (Chapter 3), model selection (Chapter 4), RAG principles (Chapter 5), and knowledge base construction (Chapter 6).
Upcoming chapters will go deeper into Dify's advanced features: complex workflow orchestration, Agent production deployment, and deep integration with business systems.