Chapter 6

Building Knowledge Bases: Document Processing, Chunking and Index Optimization

Chapter 6: Building Knowledge Bases — Document Processing, Chunking Strategies, and Index Optimization

Knowledge base quality determines the ceiling of RAG applications. The right document processing and chunking strategy can improve the same AI model's answer quality by 30-50%.

Chapter Overview

Many developers treat Dify's knowledge base as a "file upload" feature: upload documents, wait for processing to complete, then start using it. This approach gets the system running, but rarely gets it performing well.

Building the knowledge base is the decisive step for RAG application quality. Document quality, chunking strategy, Embedding model selection, and index parameter configuration — every step significantly affects retrieval quality, which in turn affects the accuracy of AI answers.

This chapter systematically explains how to build a high-quality knowledge base from a practical standpoint. We'll cover the complete pipeline from document preparation to production operation, including details that aren't prominently highlighted in the Dify interface but are critically important.

By the end of this chapter, you will be able to:

Choose appropriate preprocessing strategies for different document types
Understand the principles behind chunking strategies and make sound parameter choices
Select the right Embedding model for different scenarios
Optimize Weaviate's index configuration for best performance
Establish version management and continuous maintenance mechanisms for knowledge bases
Diagnose and solve common knowledge base quality problems

Level 1: Foundational Understanding (1-3 Years Experience)

The Complete Knowledge Base Building Pipeline

Building a high-quality knowledge base requires going through these phases:

Phase 1: Document Preparation
  ↓ Collect, clean, and format raw documents

Phase 2: Document Upload and Pre-processing
  ↓ Dify parses document format, extracts plain text

Phase 3: Text Chunking
  ↓ Split long documents into chunks suitable for retrieval

Phase 4: Vectorization
  ↓ Embedding model converts text to vectors

Phase 5: Index Building
  ↓ Vectors stored in vector database, retrieval index built

Phase 6: Validation and Tuning
  ↓ Test retrieval performance, adjust parameters

Phase 7: Continuous Maintenance
  ↓ Document updates, additions, deletions

Decisions at each phase affect the final result. Let's dive into each one.

Document Preparation: Quality Determines the Ceiling

The quality of documents uploaded to the knowledge base is the quality ceiling for the entire RAG system. Garbage in, garbage out.

Common document quality problems and solutions:

Problem	Symptom	Solution
Scanned PDF OCR errors	Garbled text, wrong numbers	Use high-quality OCR tools or manually correct
Formatting issues	Table data parsed as scrambled text	Convert to Markdown or structured text
Redundant content	Headers, footers, copyright notices consuming many tokens	Remove during preprocessing
Duplicate content	Multiple document versions coexisting	Establish version management, keep only latest
Low information density	Large amounts of whitespace, images, decorative content	Extract core text content

Recommended document formats (best to worst):

1. Markdown (.md) — Best
   - Clear structure (headings, lists, code blocks)
   - Excellent chunking results (can split by heading level)
   - High text density, no format noise

2. Plain text (.txt)
   - No format noise
   - Lacks structural information, chunking relies on content patterns

3. Word (.docx)
   - Dify can parse basic formatting
   - Complex elements like tables and images have limited parsing accuracy

4. PDF
   - Native PDF (digitally created): Good parsing results
   - Scanned PDF: Requires OCR; quality depends on OCR engine quality

5. HTML
   - Dify removes HTML tags, extracts plain text
   - Navigation bars, ads, and other noise need preprocessing removal

Document preprocessing script example:

import re
from pathlib import Path

def preprocess_document(input_path: str, output_path: str):
    """
    Generic document preprocessing: remove common noise content
    """
    with open(input_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # 1. Remove consecutive blank lines (replace 3+ blank lines with 1)
    content = re.sub(r'\n{3,}', '\n\n', content)
    
    # 2. Remove common header/footer patterns (adjust for your documents)
    patterns_to_remove = [
        r'Page \d+ of \d+',             # "Page X of Y"
        r'Copyright ©.*?All rights reserved',  # Copyright notices
        r'www\.[a-zA-Z0-9-]+\.[a-zA-Z]{2,}',  # URLs
        r'confidential|internal use only',      # Confidentiality markers
    ]
    
    for pattern in patterns_to_remove:
        content = re.sub(pattern, '', content, flags=re.IGNORECASE)
    
    # 3. Normalize whitespace
    content = re.sub(r' {2,}', ' ', content)  # Multiple spaces to single
    
    # 4. Strip leading/trailing whitespace
    content = content.strip()
    
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(content)
    
    print(f"Processed: {Path(input_path).name} → {Path(output_path).name}")
    print(f"Original size: {Path(input_path).stat().st_size} bytes")
    print(f"Processed size: {Path(output_path).stat().st_size} bytes")

Choosing a Chunking Strategy

Chunking is the process of splitting long documents into small segments. The goal of chunking: each segment should be independently complete, sufficient to answer a specific type of question.

Dify's built-in chunking options:

When creating a knowledge base, choose "Automatic" or "Custom" chunking:

Automatic chunking (recommended for beginners):
  - Dify automatically identifies document structure
  - Splits at paragraph and semantic boundaries
  - Works well for most cases

Custom chunking (recommended for advanced users):
  Parameters:
  ├── Chunk size: maximum characters per chunk (default 500)
  ├── Chunk overlap: repeated characters between adjacent chunks (default 50)
  └── Separator: what delimiter to use for splitting (e.g., \n\n for paragraphs)

Chunk size selection guide:

Document Type	Recommended Chunk Size	Recommended Overlap	Reason
Technical docs (long-form)	800-1000 chars	100 chars	Technical concepts need sufficient context
Product manuals	500-600 chars	50 chars	Standard size, stable performance
FAQ Q&A	200-300 chars	0-30 chars	Each QA pair is inherently independent
Legal text	600-800 chars	100 chars	Legal clauses need complete citation
News articles	400-500 chars	50 chars	News paragraphs are usually self-contained

An important practical principle: Chunk size should match the expected complexity of questions. Users asking simple questions ("What is the price?") need small chunks; users asking complex questions ("What is the technical architecture of this product?") need larger chunks.

Complete Walkthrough: Creating a Knowledge Base in Dify

Step 1: Enter the Knowledge Base module

Click "Knowledge" in the Dify left navigation → "Create Knowledge"

Step 2: Choose a data source

Data source options:
├── Upload files (focus of this chapter)
├── Sync via URL (web page content)
└── Connect Notion (enterprise users)

Step 3: Choose chunking and cleaning strategy

In "Data Processing Method" select:

Automatic segmentation and cleaning (simple scenarios)
Custom segmentation rules (advanced scenarios)

Custom segmentation configuration example:

Segment identifier: \n\n  (split by blank line)
Maximum segment length: 600 characters
Segment overlap: 60 characters
Text preprocessing rules:
  ☑ Remove all URLs and email addresses
  ☑ Remove all HTML tags
  ☑ Replace consecutive spaces with single space

Step 4: Choose indexing mode

High Quality:
  - Uses LLM to summarize segment content (QA pairs)
  - Uses both vector index and keyword index
  - Best results, consumes tokens

Economical:
  - Directly vectorizes document content
  - Uses inverted index only (keywords)
  - Low cost, slightly weaker performance

Recommendation: Choose "High Quality" mode for the vast majority of scenarios. Token consumption is acceptable (processing 100 pages consumes roughly 20,000-50,000 tokens, costing less than $1).

Step 5: Choose Embedding model

Recommended choices:
├── text-embedding-3-small (OpenAI) — Low cost, good quality, multilingual
├── bge-m3 (local deployment) — Best multilingual performance, no API fee
└── text-embedding-3-large (OpenAI) — Best quality, higher cost

Level 2: Mechanism Deep Dive (3-5 Years Experience)

Understanding Dify's Document Processing Pipeline

After you upload a document, Dify executes this processing pipeline in the background:

# Dify document processing pipeline (pseudocode)
class DocumentProcessingPipeline:
    def process(self, file: UploadedFile, dataset: Dataset, config: IndexingConfig):
        
        # Step 1: Document parsing (convert binary file to plain text)
        parser = self.get_parser(file.extension)  # PDF parser, Word parser, etc.
        raw_text = parser.parse(file.content)
        
        # Step 2: Text cleaning
        cleaner = TextCleaner(config.cleaning_rules)
        cleaned_text = cleaner.clean(raw_text)
        
        # Step 3: Text chunking
        splitter = TextSplitter(
            chunk_size=config.segment_max_tokens,
            chunk_overlap=config.segment_overlap,
            separator=config.separator
        )
        chunks = splitter.split(cleaned_text)
        
        # Step 4: Vectorization (batch processing for efficiency)
        embedding_model = self.get_embedding_model(dataset.embedding_model_id)
        embeddings = []
        
        for batch in self.batch_chunks(chunks, batch_size=100):
            batch_texts = [chunk.text for chunk in batch]
            batch_embeddings = embedding_model.encode(batch_texts)
            embeddings.extend(batch_embeddings)
        
        # Step 5: Storage
        for chunk, embedding in zip(chunks, embeddings):
            # Store in vector database
            self.vector_db.upsert(chunk, embedding, dataset.id)
            # Store in relational database (for management and keyword retrieval)
            self.relational_db.insert(chunk, dataset.id)
        
        # Step 6: Update document status
        self.update_document_status(file.id, status=DocumentStatus.COMPLETED)

Document format parsing quality:

PDF parsing quality depends on PDF type:

Native PDF (digitally created):
  - Text extraction accuracy: > 99%
  - Tables: Mostly extractable, but formatting may be lost
  - Mathematical formulas: Usually unrecognizable

Scanned PDF (image-based):
  - Relies on OCR; accuracy: 70-95% (depends on scan quality and OCR engine)
  - Dify doesn't do OCR by default — must enable in configuration
  - Enabling OCR requires configuring an OCR service (e.g., Azure Form Recognizer)

Improving PDF parsing quality:
1. Use PyMuPDF (fitz) for text extraction
2. Use pdfplumber for table extraction
3. Use Adobe PDF Services API for high-quality OCR

Parent-Child Chunking Strategy

Dify v0.10+ introduced parent-child chunking — an advanced strategy that significantly improves retrieval quality.

Core idea:

Child chunks: Small segments (200-300 chars) for precise retrieval (vector similarity is more accurate on smaller chunks)
Parent chunks: Large segments (1000-2000 chars) passed to LLM as context (contain more complete information)

Retrieval process:

Use user's question to retrieve child chunks (precise localization)
Once child chunk is found, retrieve its corresponding parent chunk
Pass parent chunk content to LLM (complete context)

Effectiveness comparison (using technical documentation as example):

Traditional chunking (500-char chunks):
  Retrieved: 500-char segment, possibly cut in the middle of a sentence
  Problem: Incomplete context, LLM cannot give a complete answer

Parent-child chunking (200-char child + 1500-char parent):
  Child chunk retrieval precisely locates relevant paragraph
  Parent chunk provides complete section/chapter context
  LLM gives accurate answer based on complete information

Measured results:
  Traditional chunking Recall@5: 78%
  Parent-child chunking Recall@5: 87% (~12% improvement)

Configuring parent-child chunking in Dify:

Knowledge Base Settings → Segmentation Method → Select "Parent-Child Segmentation":

Parent chunk settings:
  Chunk size: 1500 characters
  Separator: \n\n (by paragraph)

Child chunk settings:
  Chunk size: 200 characters
  Overlap: 20 characters

The Importance of Metadata

Adding metadata to document chunks in the knowledge base can greatly improve retrieval precision and usability.

Metadata types supported by Dify:

# Adding metadata when uploading documents via Dify API
metadata = {
    "document_type": "product_manual",  # Document type
    "version": "2.0",                   # Version number
    "effective_date": "2024-01-01",     # Effective date
    "department": "engineering",         # Department
    "confidentiality": "public",        # Confidentiality level
    "language": "en",                   # Language
}

# Upload document
response = requests.post(
    f"{DIFY_API_URL}/datasets/{dataset_id}/documents/create-by-file",
    headers={"Authorization": f"Bearer {API_KEY}"},
    files={"file": open("product_manual_v2.pdf", "rb")},
    data={
        "indexing_technique": "high_quality",
        "doc_metadata": json.dumps(metadata)
    }
)

Metadata-based retrieval filtering (in a workflow):

Knowledge Retrieval node configuration:
  Retrieval query: {{user_question}}
  Filter conditions:
    - document_type == "product_manual"  (only retrieve product manuals)
    - version == "2.0"                    (only retrieve latest version)
    - language == "en"                    (only retrieve English documents)

This approach enables precise knowledge base isolation — particularly suited for enterprise knowledge base management across multiple products and versions.

Best Practices for Document Updates

A universal challenge in enterprise knowledge bases: documents need continuous updates. How do you update the knowledge base without disrupting service?

Option 1: Direct replacement (simple scenarios)

# Update document via Dify API
# Step 1: Delete old document
curl -X DELETE \
  -H "Authorization: Bearer $API_KEY" \
  "https://api.dify.ai/v1/datasets/$DATASET_ID/documents/$OLD_DOC_ID"

# Step 2: Upload new document
curl -X POST \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@new_document.pdf" \
  -F "indexing_technique=high_quality" \
  "https://api.dify.ai/v1/datasets/$DATASET_ID/documents/create-by-file"

Note: There's a time window between deletion and re-upload during which the document content isn't in the knowledge base. For critical documents, this may cause query failures.

Option 2: Blue-Green Knowledge Base (zero-downtime updates)

class KnowledgeBaseUpdater:
    """
    Blue-green knowledge base update strategy:
    Maintain two knowledge bases (blue: current production, green: pending update)
    Operate on green during updates, then switch traffic to green when done
    """
    
    def __init__(self, dify_client):
        self.client = dify_client
        self.blue_dataset_id = "dataset-blue-xxx"   # Current production KB
        self.green_dataset_id = "dataset-green-xxx"  # Pending update KB
        self.active = "blue"  # Currently active KB
    
    def update_knowledge_base(self, new_documents: list[str]):
        # 1. Update documents in the inactive KB (green)
        inactive = self.green_dataset_id if self.active == "blue" else self.blue_dataset_id
        
        # Clear old content
        self.client.clear_dataset(inactive)
        
        # Upload new documents
        for doc_path in new_documents:
            self.client.upload_document(inactive, doc_path)
        
        # Wait for indexing to complete
        self.wait_for_indexing_complete(inactive)
        
        # Validate new KB (run test queries)
        if self.validate_knowledge_base(inactive):
            # Switch traffic: update application's KB ID configuration
            self.switch_traffic(inactive)
            self.active = "green" if self.active == "blue" else "blue"
            print(f"Successfully switched to knowledge base: {inactive}")
        else:
            print("Validation failed, keeping original knowledge base")
    
    def validate_knowledge_base(self, dataset_id: str) -> bool:
        """Run validation queries to ensure new KB works correctly"""
        test_queries = [
            "export data formats",
            "how to register account",
            "password reset steps",
        ]
        
        for query in test_queries:
            results = self.client.retrieve(dataset_id, query, top_k=3)
            if not results or results[0].score < 0.5:
                print(f"Validation failed: query '{query}' didn't get high-quality results")
                return False
        
        return True

Level 3: Source Code and Principles (5+ Years Experience)

Dify Document Chunking Implementation Details

Dify uses LangChain's RecursiveCharacterTextSplitter as the base chunking implementation (though progressively migrating to its own implementation, the core algorithm is the same):

# Dify text chunking implementation (related logic in api/core/indexing_runner.py)
class FixedRecursiveCharacterTextSplitter:
    """
    Recursive character text splitter:
    Tries separators in priority order, using the highest-priority separator first.
    If chunks are still too large, falls back to the next separator.
    """
    
    DEFAULT_SEPARATORS = [
        "\n\n",   # First try splitting by paragraph
        "\n",     # Then by newline
        ". ",     # Then by period + space (sentence boundary)
        ".",      # Then by period
        ";",      # Then by semicolon
        " ",      # Then by space
        "",       # Finally, force-split by character count
    ]
    
    def __init__(self, chunk_size: int, chunk_overlap: int, separators: list[str] = None):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.separators = separators or self.DEFAULT_SEPARATORS
    
    def split_text(self, text: str) -> list[str]:
        return self._split_text(text, self.separators)
    
    def _split_text(self, text: str, separators: list[str]) -> list[str]:
        # Try the current priority separator
        separator = separators[0]
        splits = text.split(separator)
        
        chunks = []
        current_chunk = ""
        
        for split in splits:
            if len(current_chunk) + len(split) + len(separator) <= self.chunk_size:
                current_chunk += split + separator
            else:
                if current_chunk:
                    chunks.append(current_chunk.strip())
                
                # If a single piece is still too large, use next separator recursively
                if len(split) > self.chunk_size and len(separators) > 1:
                    sub_chunks = self._split_text(split, separators[1:])
                    chunks.extend(sub_chunks[:-1])
                    current_chunk = sub_chunks[-1] + separator
                else:
                    current_chunk = split + separator
        
        if current_chunk.strip():
            chunks.append(current_chunk.strip())
        
        # Handle overlap: add overlapping content between adjacent chunks
        if self.chunk_overlap > 0:
            chunks = self._add_overlap(chunks)
        
        return chunks
    
    def _add_overlap(self, chunks: list[str]) -> list[str]:
        """Add overlap content between adjacent chunks"""
        if len(chunks) <= 1:
            return chunks
        
        overlapped_chunks = [chunks[0]]
        
        for i in range(1, len(chunks)):
            prev_chunk = chunks[i - 1]
            current_chunk = chunks[i]
            
            # Take the last `overlap` characters of previous chunk as prefix
            overlap_text = prev_chunk[-self.chunk_overlap:]
            overlapped_chunks.append(overlap_text + current_chunk)
        
        return overlapped_chunks

Understanding why recursion is needed:

Consider this situation: a paragraph has 2,000 characters, exceeding the 500-character chunk limit. If split by \n\n, this paragraph won't be cut. The algorithm then falls back to \n, then to periods, and so on, until it can produce appropriately sized chunks. This recursive strategy guarantees that chunks never exceed chunk_size in any situation, while preserving semantic integrity as much as possible.

Optimizing Embedding Batch Processing

Dify uses batch requests during document vectorization to reduce API calls and latency:

class EmbeddingBatchProcessor:
    """
    Batch processing for document vectorization, optimizing API call efficiency
    """
    
    def __init__(self, model, batch_size: int = 100):
        self.model = model
        self.batch_size = batch_size
    
    def process_chunks(self, chunks: list[DocumentChunk]) -> list[DocumentChunk]:
        """
        Batch vectorize document chunks
        
        OpenAI text-embedding-3-small limits:
        - Max 2,048 texts per request
        - Max 8,191 tokens per text
        - Rate limit: 1,000,000 TPM
        """
        
        processed_chunks = []
        
        for i in range(0, len(chunks), self.batch_size):
            batch = chunks[i:i + self.batch_size]
            batch_texts = [chunk.content for chunk in batch]
            
            try:
                embeddings = self.model.embed_documents(batch_texts)
            except RateLimitError:
                time.sleep(60)  # Wait then retry on rate limit
                embeddings = self.model.embed_documents(batch_texts)
            
            for chunk, embedding in zip(batch, embeddings):
                chunk.embedding = embedding
                processed_chunks.append(chunk)
            
            progress = (i + len(batch)) / len(chunks) * 100
            print(f"Vectorization progress: {progress:.1f}% ({i + len(batch)}/{len(chunks)})")
        
        return processed_chunks

Speed reference (text-embedding-3-small, 1,000 document chunks, 100 tokens each):

Small batch (batch_size=10):
  API calls: 100
  Total time: ~30 seconds (heavily affected by network latency)

Large batch (batch_size=100):
  API calls: 10
  Total time: ~8 seconds (network latency amortized)

Maximum batch (batch_size=2048):
  API calls: 1 (theoretically)
  Total time: ~3-5 seconds (limited by API processing time)
  
  Note: Larger batches aren't always better — very large batches may hit token limits

Weaviate Data Model and Query Optimization

Understanding Weaviate's data model helps with performance tuning when needed:

# Full configuration of Weaviate collection (used when Dify creates knowledge base)
collection_config = {
    "class": "DatasetXxx",
    "description": "Dify dataset collection",
    
    # Vector index configuration
    "vectorIndexConfig": {
        "distance": "cosine",          # Use cosine similarity
        "ef": 64,                       # Dynamic candidate list size
        "efConstruction": 128,          # Candidate list size during index building
        "maxConnections": 64,           # Max connections per node
        "vectorCacheMaxObjects": 1000000,  # Vector cache object count
        "dynamicEfFactor": 8,
        "dynamicEfMin": 25,
        "dynamicEfMax": 500,
    },
    
    # Data property definitions
    "properties": [
        {
            "name": "text",
            "dataType": ["text"],
            "indexInverted": True,   # Enable keyword index (for hybrid search)
        },
        {
            "name": "doc_id",
            "dataType": ["string"],
            "indexInverted": False,  # Used for filtering only, no keyword index needed
        },
        # ... other metadata fields
    ],
    
    # Inverted index configuration (for BM25 full-text retrieval)
    "invertedIndexConfig": {
        "bm25": {
            "b": 0.75,    # BM25 b parameter (document length normalization factor)
            "k1": 1.2,    # BM25 k1 parameter (term frequency saturation)
        },
        "stopwords": {
            "preset": "en",  # English stopwords
        },
        "cleanupIntervalSeconds": 60,
    }
}

Level 4: Production Pitfalls and Decision Making (Expert Perspective)

Pitfall 1: Knowledge Base "Overload" Problem

A very common mistake: putting all content into a single knowledge base, expecting AI to find answers to any question from it.

Problem: When a knowledge base contains more than 10,000 document chunks, retrieval precision starts declining. Reasons:

More chunks in vector space means more "noise"
The same word may have different meanings in different contexts, interfering with retrieval
In BM25 retrieval, IDF values for common words get pulled down

Solution: Layered knowledge base architecture

Recommended architecture:

Layer 1: General knowledge base (company-wide shared)
  - Company overview, product overview, policies and regulations
  - Foundational content that updates infrequently
  - Size: < 1,000 document chunks

Layer 2: Product knowledge bases (split by product line)
  - Product A manual knowledge base
  - Product B manual knowledge base
  - Each independent, reducing cross-interference

Layer 3: Department-specific knowledge bases
  - Technical documentation library (for engineers)
  - Sales knowledge base (for sales team)
  - Customer service scripts (for support team)

Multi-knowledge-base routing in Dify (using Workflow):

Workflow nodes:
[Start] → Receive question
  ↓
[LLM: Question Classification] → Determine question type
  {product_a, product_b, general, technical}
  ↓
[IF/ELSE Branch]
  ├── product_a → [Knowledge Retrieval: Product A KB]
  ├── product_b → [Knowledge Retrieval: Product B KB]
  ├── technical → [Knowledge Retrieval: Technical Docs KB]
  └── general → [Knowledge Retrieval: General KB]
  ↓
[LLM: Generate Answer]

Pitfall 2: Silent Failure of Document Parsing

When Dify processes documents, if a document fails to parse, it silently marks it as "error" status without stopping other documents from processing, and without actively notifying you.

How to discover parsing failures:

# Check document status in knowledge base via API
curl -H "Authorization: Bearer $API_KEY" \
  "https://api.dify.ai/v1/datasets/$DATASET_ID/documents?page=1&limit=50" \
  | python3 -c "
import sys, json
data = json.load(sys.stdin)
for doc in data.get('data', []):
    if doc.get('indexing_status') != 'completed':
        print(f\"FAILED: {doc.get('name')} - Status: {doc.get('indexing_status')}\")
"

Common parsing failure causes and solutions:

Error Cause	Symptom	Solution
Scanned PDF without OCR enabled	indexing_status: error	Enable OCR config or manually extract text
File exceeds size limit	Upload fails immediately	Split large files or compress
Special character encoding issues	Garbled text	Convert to UTF-8 encoding
Password-protected PDF	Cannot parse	Remove PDF password protection
Corrupted Office file	indexing_status: error	Repair or re-export file

Build indexing monitoring:

import requests
import time

def monitor_indexing_status(dataset_id: str, api_key: str, expected_count: int):
    """
    Monitor knowledge base document indexing status until all documents are processed
    """
    headers = {"Authorization": f"Bearer {api_key}"}
    
    while True:
        response = requests.get(
            f"https://api.dify.ai/v1/datasets/{dataset_id}/documents",
            headers=headers,
            params={"page": 1, "limit": 100}
        )
        
        documents = response.json().get("data", [])
        
        completed = [d for d in documents if d["indexing_status"] == "completed"]
        failed = [d for d in documents if d["indexing_status"] == "error"]
        processing = [d for d in documents if d["indexing_status"] in 
                      ["waiting", "parsing", "cleaning", "splitting", "indexing"]]
        
        print(f"Progress: {len(completed)}/{expected_count} complete, "
              f"{len(failed)} failed, {len(processing)} processing")
        
        if failed:
            for doc in failed:
                print(f"  [ERROR] {doc['name']}: {doc.get('error', 'Unknown error')}")
        
        if len(completed) + len(failed) >= expected_count:
            print("All documents processed!")
            break
        
        time.sleep(30)  # Check every 30 seconds

Pitfall 3: Knowledge Base "Memory Drift"

After running in production for a while, the knowledge base may exhibit "memory drift":

The same question gets different answers today compared to three months ago
Even though documents were updated, AI still provides old version information

Root cause investigation checklist:

□ Verify document update succeeded
  → Check document updated_at timestamp via API
  
□ Verify old documents were deleted
  → If new documents were uploaded without deleting old ones, old content still exists

□ Check for orphaned data in vector database
  → Sometimes Dify metadata deletes succeed but vector data isn't synchronized

□ Verify Embedding model wasn't changed
  → After changing Embedding model, old vectors are invalid — must rebuild index

□ Check knowledge base cache
  → Dify has short-term retrieval result cache (typically a few minutes)
  → Recently updated content may still be cached

Build a scheduled knowledge base health check:

import schedule
import time

def knowledge_base_health_check():
    """Daily health check for knowledge base at 2 AM"""
    
    # 1. Check document count meets expectations
    doc_count = get_document_count(DATASET_ID)
    if doc_count < EXPECTED_MIN_DOCS:
        alert(f"Knowledge base document count below expected: {doc_count} < {EXPECTED_MIN_DOCS}")
    
    # 2. Run standard test queries
    test_cases = [
        ("export formats", "CSV"),      # query, expected keyword in results
        ("register account", "email"),
        ("password reset", "24 hours"),
    ]
    
    for query, expected_keyword in test_cases:
        results = retrieve(DATASET_ID, query, top_k=3)
        if not any(expected_keyword.lower() in r.text.lower() for r in results):
            alert(f"Retrieval quality degraded: query '{query}' "
                  f"didn't find chunks containing '{expected_keyword}'")
    
    # 3. Check average retrieval score
    scores = [r.score for r in retrieve(DATASET_ID, "test query", top_k=5)]
    avg_score = sum(scores) / len(scores) if scores else 0
    if avg_score < 0.5:
        alert(f"Average retrieval score too low: {avg_score:.3f}")

# Set up scheduled task
schedule.every().day.at("02:00").do(knowledge_base_health_check)

while True:
    schedule.run_pending()
    time.sleep(60)

Knowledge Base Performance Benchmarks and Budget Planning

Processing time benchmarks (different knowledge base sizes, Dify default config):

Scale: 1,000 document chunks (~100 pages)
  Vectorization time: ~1-3 minutes
  Total processing time (including parsing, storage): ~3-8 minutes
  Embedding cost (text-embedding-3-small): ~$0.002

Scale: 10,000 document chunks (~1,000 pages)
  Vectorization time: ~10-30 minutes
  Total processing time: ~20-60 minutes
  Embedding cost: ~$0.02

Scale: 100,000 document chunks (~10,000 pages)
  Vectorization time: ~1-3 hours
  Total processing time: ~2-6 hours
  Embedding cost: ~$0.2

Scale: 1,000,000 document chunks (~100,000 pages)
  May need to evaluate Weaviate capacity (default config may need scaling)
  Vectorization time: ~10-30 hours
  Embedding cost: ~$2

Vector storage space estimation:

Storage space per document chunk:
  Vector (1536 dims × 4 bytes): 6,144 bytes ≈ 6KB
  Text (avg 500 chars): ~1KB
  Metadata: ~0.5KB
  Total: ~7.5KB per chunk

Storage requirement for 100,000 chunks:
  Vector database (Weaviate): ~750MB
  PostgreSQL (metadata): ~100MB
  Total: ~1GB (with index overhead ~1.5-2GB)

Chapter Summary

A high-quality knowledge base is not "upload documents and you're done" — it requires careful design at multiple steps: document preparation, chunking strategy, Embedding selection, index configuration, and continuous maintenance.

Key Takeaways:

Document quality is the ceiling: Garbage in, garbage out; time spent on document preprocessing yields immediate improvement
Chunking strategy must match the business: Small chunks for FAQ, large chunks for technical docs, parent-child chunking for complex scenarios
Metadata is the key to precise routing: Add version, type, and other metadata to documents; filter during retrieval for higher precision
Monitoring and maintenance are ongoing work: Knowledge bases aren't a one-time build — they need continuous updating and optimization as business evolves
Knowledge bases need layered design: Don't stuff all content into one knowledge base; manage by business domain, product, and department

This chapter concludes the knowledge base section of the Dify book. At this point, you have mastered the complete knowledge system for building a high-quality knowledge Q&A system from scratch: from Dify's foundational concepts (Chapters 1-2), to quick start (Chapter 3), model selection (Chapter 4), RAG principles (Chapter 5), and knowledge base construction (Chapter 6).

Upcoming chapters will go deeper into Dify's advanced features: complex workflow orchestration, Agent production deployment, and deep integration with business systems.

Rate this chapter

4.7 / 5 (54 ratings)