Chapter 8

Multi-Knowledge-Base Queries and Enterprise Document Permission Management

Chapter 8: Multi-Knowledge-Base Joint Query and Enterprise Document Permission Management

A single knowledge base cannot meet the complex information architecture needs of enterprises; this chapter explains how to search across multiple knowledge bases collaboratively while ensuring data security and permission isolation.

Chapter Overview

When enterprises deploy AI knowledge bases, they almost inevitably encounter this situation: the company has dozens of departments, each with its own document system. HR policy documents, financial reports, technical API docs, and product manuals must be isolated from each other — yet sometimes cross-department joint search is needed.

Even more complex is the permissions problem: regular employees can only access the product manual, HR specialists can view salary bands, and only managers can see strategic planning documents. How do you implement such a permission system in Dify while maintaining retrieval quality?

This chapter systematically covers:

Multi-knowledge-base architecture design in Dify
How to implement cross-knowledge-base joint queries
Enterprise document permission management strategies
Operational challenges and solutions for large-scale knowledge bases

Level 1: Fundamentals (1–3 Years Experience)

1.1 Dify Knowledge Base Hierarchy

A knowledge base (Dataset) in Dify is an independent namespace containing:

Documents: uploaded source files (PDF, Word, TXT, etc.)
Segments/Chunks: the smallest retrieval units after document splitting
Vector index: vector representations of each chunk, stored in the vector database
Keyword index: inverted index used for full-text search

A single Dify instance can host multiple Datasets, and each application can bind to one or more Datasets.

Key concept: Datasets are the basic unit of permission isolation. Data across different Datasets is completely independent and cannot interfere with each other.

1.2 Binding Multiple Knowledge Bases to a Single Application

The simplest way to use multiple knowledge bases is to bind multiple Datasets to one application.

Steps:

Open the Dify application orchestration interface
Click the "Knowledge Base" panel
Search and add multiple knowledge bases
Set retrieval parameters for each (independently configurable)

How it works: When a user asks a question, Dify concurrently queries all bound knowledge bases, then merges the retrieval results and sends them together to the model.

Use cases:

Product Q&A chatbot: simultaneously query product manual + FAQ + troubleshooting knowledge base
Enterprise assistant: simultaneously query company policy + department rules + industry standards

Limitations:

All users can access all bound knowledge bases (no permission control)
Result quality degrades when too many knowledge bases are bound
Recommendation: bind no more than 5 knowledge bases per application

1.3 Conditional Retrieval via Workflow

For scenarios that require selecting different knowledge bases based on user identity, workflows provide the most flexible approach.

Basic workflow design:

User input
    ↓
Intent classification node (LLM)
  Determines which domain the question belongs to
    ↓
    ├── "HR Policy" → Query HR knowledge base
    ├── "Technical Support" → Query tech docs knowledge base
    ├── "Finance" → Query finance knowledge base
    └── Default → Query general knowledge base
    ↓
Merge results + model generates answer

Intent classification prompt:

Based on the user's question, determine which category it belongs to
(output only the category name):

- HR_POLICY: employee benefits, leave, salary, performance, etc.
- TECH_SUPPORT: product features, API, technical issues, etc.
- FINANCE: expense reimbursement, budget, financial processes, etc.
- GENERAL: other questions

User question: {{query}}

1.4 Understanding Dify's Permission Model

Dify's native permission model is Workspace-based:

Owner: highest privilege, can manage all resources
Admin: can manage knowledge bases, applications, and members
Editor: can create and edit applications, but cannot manage members
Member: can only use published applications (chatbot UI)

This is a coarse-grained workspace-level control. For the document-level permissions enterprises need ("only HR can see salary documents"), additional implementation at the application layer is required.

Level 2: Mechanisms in Depth (3–5 Years Experience)

2.1 Performance Characteristics of Parallel Multi-Knowledge-Base Retrieval

When an application binds multiple knowledge bases, Dify uses concurrent query strategy:

# Pseudocode: Dify multi-dataset retrieval logic
async def multi_dataset_retrieve(query: str, dataset_ids: list) -> list:
    tasks = []
    for dataset_id in dataset_ids:
        task = asyncio.create_task(
            single_dataset_retrieve(query, dataset_id)
        )
        tasks.append(task)

    # Wait for all knowledge base retrievals to complete
    results = await asyncio.gather(*tasks)

    # Flatten and merge
    merged = []
    for result_list in results:
        merged.extend(result_list)

    # Sort by score
    merged.sort(key=lambda x: x.score, reverse=True)

    return merged[:top_k]

Performance implications:

Retrieval latency = max(individual knowledge base latencies), not sum
Concurrent retrieval of 5 knowledge bases approaches the latency of a single one
But Top-K merging produces N × Top-K total candidates (requiring extra filtering)

Actual latency measurements (each knowledge base P50 = 50ms):

1 knowledge base: ~55ms
3 knowledge bases: ~65ms
10 knowledge bases: ~90ms (main overhead shifts to result merging and Rerank)

2.2 Designing Enterprise-Grade Permission Architecture

Option 1: Dataset Isolation (Recommended — highest security)

User role → Authorized Dataset IDs

HR Specialist:
  - dataset_company_policy
  - dataset_hr_handbook
  - dataset_salary_bands (HR-exclusive)

Engineering Manager:
  - dataset_company_policy
  - dataset_tech_docs
  - dataset_engineering_decisions (manager-exclusive)

All Employees:
  - dataset_company_policy
  - dataset_product_handbook

Implementation:

Create a dedicated Dify application per permission level
Each application binds only the knowledge bases allowed for that permission level
Integrate with SSO/IdP to route users to the appropriate application

Option 2: API-layer Permission Proxy

class DifyPermissionProxy:
    def __init__(self, dify_base_url: str, dify_api_key: str):
        self.dify = DifyClient(dify_base_url, dify_api_key)
        self.permission_db = PermissionDB()

    def query(self, user_id: str, query: str, conversation_id: str = None):
        # 1. Get Dataset IDs the user is authorized to access
        allowed_datasets = self.permission_db.get_allowed_datasets(user_id)

        # 2. Dynamically build query request with only authorized knowledge bases
        response = self.dify.chat(
            query=query,
            conversation_id=conversation_id,
            extra_context={"allowed_dataset_ids": allowed_datasets}
        )
        return response

    def audit_log(self, user_id: str, query: str, retrieved_docs: list):
        # Record which documents the user accessed (compliance requirement)
        self.audit_db.log(
            user_id=user_id,
            query=query,
            doc_ids=[doc.id for doc in retrieved_docs],
            timestamp=datetime.now()
        )

2.3 Result Fusion Strategy for Cross-Knowledge-Base Queries

The biggest challenge with multiple knowledge bases is result heterogeneity: different knowledge bases may use different embedding models or chunking strategies, leading to inconsistent score distributions.

Example problem:

Knowledge Base A (technical docs): uses text-embedding-3-large, cosine similarity generally high (0.75–0.90)
Knowledge Base B (legal documents): uses bge-m3, cosine similarity generally lower (0.45–0.65)

If merged by raw score, Knowledge Base A results will systematically rank higher regardless of actual relevance.

Solution: Normalized score fusion

def normalize_and_merge(results_by_dataset: dict) -> list:
    """
    Apply min-max normalization to each knowledge base's scores before merging
    """
    all_results = []

    for dataset_id, results in results_by_dataset.items():
        if not results:
            continue

        scores = [r.score for r in results]
        min_score = min(scores)
        max_score = max(scores)
        score_range = max_score - min_score

        for result in results:
            if score_range > 0:
                normalized = (result.score - min_score) / score_range
            else:
                normalized = 1.0

            all_results.append({
                "content": result.content,
                "source_dataset": dataset_id,
                "original_score": result.score,
                "normalized_score": normalized,
                "metadata": result.metadata
            })

    all_results.sort(key=lambda x: x["normalized_score"], reverse=True)
    return all_results

Even better: unified Rerank

Regardless of each knowledge base's raw scores, apply Rerank to the merged pool:

Knowledge Base A results (Top-20)  ─┐
Knowledge Base B results (Top-20)  ─┼→ Merge 60 candidates → Rerank → Top-10 final
Knowledge Base C results (Top-20)  ─┘

Rerank scores based on the query-document pair content itself, independent of original retrieval scores — effectively eliminating distribution bias.

2.4 Document Metadata Design for Permissions and Filtering

Setting metadata thoughtfully when uploading documents is the key to fine-grained permissions and filtering:

def upload_document_with_metadata(
    dataset_id: str,
    file_path: str,
    metadata: dict
) -> dict:
    """
    Metadata design example:
    {
        "department": "hr",
        "classification": "confidential",  # public/internal/confidential/secret
        "allowed_roles": ["hr_specialist", "hr_manager", "ceo"],
        "valid_until": "2025-12-31",
        "owner": "user_123",
        "version": "v2.1",
        "language": "en-US"
    }
    """
    response = requests.post(
        f"{DIFY_BASE_URL}/datasets/{dataset_id}/documents/create_by_file",
        headers={"Authorization": f"Bearer {API_KEY}"},
        files={"file": open(file_path, "rb")},
        data={
            "data": json.dumps({
                "name": os.path.basename(file_path),
                "indexing_technique": "high_quality",
                "process_rule": {"mode": "automatic"},
                "custom_metadata": metadata
            })
        }
    )
    return response.json()

Using metadata filtering at retrieval time:

def search_with_permission(dataset_id: str, query: str, user_role: str) -> list:
    """Permission-based retrieval using metadata filters"""
    results = vector_db.search(
        query_vector=embed(query),
        filter={"allowed_roles": {"contains": user_role}},
        limit=20
    )
    return results

2.5 Knowledge Base Version Management and Update Strategy

In enterprise environments, frequent document updates are the norm.

Strategy 1: Full replacement (for small knowledge bases)

# Delete old document, upload new one
curl -X DELETE "http://dify/api/datasets/{id}/documents/{doc_id}"
curl -X POST "http://dify/api/datasets/{id}/documents/create_by_file" \
  --data-binary @new_document.pdf

Strategy 2: Incremental update (for large knowledge bases)

Maintain hash values for each document
Scheduled scanning — only update changed documents
Manage source files with Git; use diffs to determine the scope of changes

Strategy 3: Blue-green knowledge bases (for mission-critical)

Production KB (Blue) ← All current traffic
    ↓
Prepare new version KB (Green): re-index all documents
    ↓
Run evaluation benchmark: Green quality >= Blue quality?
    ↓ Yes
Switch: bind application from Blue to Green
    ↓
Retain Blue for a period as rollback backup

Level 3: Source Code and Principles (5+ Years Experience)

3.1 Dify Multi-Knowledge-Base Routing Source Code

The multi-knowledge-base routing logic in Dify resides in the workflow engine's knowledge retrieval node (KnowledgeRetrievalNode):

# api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py

class KnowledgeRetrievalNode(BaseNode):
    def _run(self, variable_pool: VariablePool) -> NodeRunResult:
        # Get all bound knowledge base IDs
        dataset_ids = self.node_data.dataset_ids

        # Retrieve query text from variable pool
        query = variable_pool.get_any(self.node_data.query_variable_selector)

        # Concurrent retrieval
        results = DatasetRetrieval().retrieve(
            model_config=self.model_config,
            config=self.node_data.retrieval_model,
            query=query,
            dataset_ids=dataset_ids,
            invoke_from=InvokeFrom.WORKFLOW,
            hit_callback=self._hit_callback
        )

        return NodeRunResult(
            status=WorkflowNodeExecutionStatus.SUCCEEDED,
            outputs={"result": [doc.to_dict() for doc in results]}
        )

Key design: DatasetRetrieval.retrieve() accepts a list of dataset_ids, queries them concurrently internally, then merges results. This design makes multi-knowledge-base queries completely transparent to upper layers.

3.2 Vector Database Multi-Tenancy Implementation Details

Using Qdrant as an example, here is how Dify achieves Dataset data isolation:

# api/core/rag/datasource/vdb/qdrant/qdrant_vector.py

class QdrantVector(BaseVector):
    # Dify creates a separate Qdrant Collection for each Dataset
    # Collection name = "dataset_" + dataset_id

    def __init__(self, dataset: Dataset, config: QdrantConfig):
        self._collection_name = Dataset.gen_collection_name_by_id(dataset.id)
        # → "dataset_550e8400-e29b-41d4-a716-446655440000"

    def search_by_vector(self, query_vector: list[float], **kwargs) -> list[Document]:
        return self._client.search(
            collection_name=self._collection_name,  # Each Dataset = isolated collection
            query_vector=query_vector,
            limit=kwargs.get('top_k', 4),
            with_payload=True,
            score_threshold=kwargs.get('score_threshold', 0)
        )

Isolation mechanism: Each Dataset corresponds to an independent Collection in Qdrant — physically isolated, impossible to cross boundaries accidentally.

Multi-tenant scaling: For very large deployments (100,000+ Datasets), one Collection per Dataset causes Collection count explosion. An alternative is shared Collection with namespace filtering:

# Logical isolation via payload filtering in a shared Collection
search_result = client.search(
    collection_name="shared_collection",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[FieldCondition(key="dataset_id", match=MatchValue(value=dataset_id))]
    )
)

This approach requires modifying Dify source code — weigh the maintenance cost carefully.

3.3 Dify's BM25 Full-Text Search Implementation

Dify's full-text search is based on database-level full-text indexing rather than a dedicated search engine like Elasticsearch:

For pgvector/PostgreSQL:

-- Simplified Dify keyword index creation
CREATE INDEX ON dataset_keyword_tables
USING gin(to_tsvector('english', content));

-- Query
SELECT *, ts_rank(to_tsvector('english', content), query) AS rank
FROM dataset_keyword_tables
WHERE dataset_id = $1
  AND to_tsvector('english', content) @@ plainto_tsquery('english', $2)
ORDER BY rank DESC
LIMIT 20;

Important note: PostgreSQL full-text search for Chinese requires pg_jieba or zhparser extensions. Without these, Chinese keyword retrieval quality will be very poor.

Check and install pg_jieba:

# Check in Dify Docker environment
docker exec dify-postgres psql -U postgres -c "\dx"

# Install pg_jieba if no Chinese tokenization extension found
apt-get install postgresql-14-jieba
# Then in psql:
CREATE EXTENSION pg_jieba;

3.4 Enterprise Audit and Compliance

In regulated industries (finance, healthcare, legal), AI knowledge bases must meet audit requirements:

from dataclasses import dataclass
from datetime import datetime

@dataclass
class RAGAuditLog:
    """Audit record for each RAG query"""
    log_id: str
    timestamp: datetime
    user_id: str
    user_role: str
    application_id: str
    query: str
    retrieved_documents: list[dict]  # [{doc_id, chunk_id, score, dataset_id}]
    model_used: str
    response_summary: str  # Do not store full response (may contain sensitive info)
    session_id: str
    ip_address: str

class AuditLogger:
    def log_query(self, audit_log: RAGAuditLog):
        # Write to tamper-proof audit database
        self.db.insert(audit_log.to_dict())

        # Detect sensitive document access
        sensitive_docs = [
            doc for doc in audit_log.retrieved_documents
            if doc.get("classification") == "secret"
        ]
        if sensitive_docs:
            self.alert_security_team(audit_log, sensitive_docs)

    def generate_compliance_report(
        self, start_date: datetime, end_date: datetime
    ) -> dict:
        """Generate compliance report: who accessed what documents, when"""
        return self.db.aggregate({
            "date_range": [start_date, end_date],
            "group_by": ["user_id", "dataset_id"],
            "metrics": ["query_count", "unique_docs_accessed"]
        })

Level 4: Production Pitfalls and Decision-Making (Expert Perspective)

4.1 Pitfall 1: Performance Degradation from Knowledge Base Count Explosion

Symptom: As the business expands, the number of knowledge bases keeps growing. An application bound to 15+ knowledge bases sees retrieval latency jump from 200ms to 2s+.

Root causes:

More concurrent queries increase database connection pool pressure
Candidate count explosion (15 × Top-20 = 300 candidates) dramatically increases Rerank computation
Result quality actually degrades: noise from irrelevant knowledge bases accumulates

Solution: Knowledge base routing layer

Rather than binding all knowledge bases directly to an application, add a routing layer:

class KnowledgeBaseRouter:
    def __init__(self, all_datasets: dict, embedding_model):
        self.datasets = all_datasets
        # Create vectors for each knowledge base's description
        self.dataset_embeddings = {
            ds_id: embedding_model.encode(ds_info["description"])
            for ds_id, ds_info in all_datasets.items()
        }

    def route(self, query: str, max_datasets: int = 3) -> list[str]:
        """Select the N most relevant knowledge bases"""
        query_embedding = self.embedding_model.encode(query)

        scores = {
            ds_id: cosine_similarity(query_embedding, ds_emb)
            for ds_id, ds_emb in self.dataset_embeddings.items()
        }

        sorted_datasets = sorted(
            scores.items(), key=lambda x: x[1], reverse=True
        )
        return [ds_id for ds_id, _ in sorted_datasets[:max_datasets]]

Regardless of how many knowledge bases exist in total, each query only accesses the most relevant 2–3, keeping performance predictable.

4.2 Pitfall 2: Duplicate Documents Across Knowledge Bases

Symptom: The same document exists in multiple knowledge bases (e.g., a company policy is in both the HR knowledge base and the all-employees knowledge base), resulting in duplicate content in retrieval results and redundant information sent to the model.

Solution: Deduplication layer

def deduplicate_results(results: list) -> list:
    """Deduplicate based on content hash"""
    seen_hashes = set()
    deduplicated = []

    for result in results:
        # Hash the first 200 characters of content (faster, avoids full comparison)
        content_hash = hashlib.md5(
            result["content"][:200].encode()
        ).hexdigest()

        if content_hash not in seen_hashes:
            seen_hashes.add(content_hash)
            deduplicated.append(result)

    return deduplicated

Better approach: establish a "master document library" where other knowledge bases reference documents rather than copying them. Dify does not natively support this — it requires implementation at the application layer.

4.3 Pitfall 3: Permission Bypass Vulnerabilities

Problem: Even with Dataset isolation, if an application's API key is leaked, an attacker can call the Dify API directly to query any knowledge base.

Protective measures:

Least-privilege API keys: Use separate API keys per application; enable API key scope restrictions in Dify
API key rotation: Rotate API keys regularly; invalidate old keys immediately
Request signing: Add HMAC signatures at the proxy layer to ensure only authorized clients can call
Network isolation: Dify service is not exposed to the public internet; accessible only on internal network

import hmac
import hashlib
import time

def verify_request_signature(request_body: str, signature: str, secret: str) -> bool:
    """Verify request signature to prevent misuse after API key exposure"""
    timestamp = int(time.time())

    # Signature includes timestamp to prevent replay attacks (5-minute validity)
    message = f"{timestamp}:{request_body}"
    expected = hmac.new(
        secret.encode(),
        message.encode(),
        hashlib.sha256
    ).hexdigest()

    return hmac.compare_digest(expected, signature)

4.4 Multi-Knowledge-Base Architecture Decision Tree

Do you need document-level permission control?
│
├── No → Single application binding multiple knowledge bases
│        (simple, good performance)
│
└── Yes
    │
    ├── Fewer than 10 user roles?
    │   └── Yes → One application per role, bind corresponding knowledge bases
    │
    └── More than 10 roles, or need fine-grained document permissions?
        │
        ├── Document count < 1M → pgvector + metadata filtering
        └── Document count > 1M → Qdrant/Weaviate + payload filtering
                                    + API proxy layer + audit logs

Chapter Summary

Multi-knowledge-base architecture is an inevitable step in enterprise AI knowledge base deployment. The core challenge is balancing flexibility with security:

Architecture selection: For fewer than 10 user roles, prioritize Dataset isolation with the multi-application approach. For more than 10 permission types, introduce an API proxy layer and metadata filtering.

Performance optimization: The knowledge base routing layer is the key to managing knowledge base count explosion — ensure each query accesses only the most relevant 2–3 knowledge bases.

Compliance guarantee: Regulated industries must build comprehensive audit log systems recording who accessed which documents, when.

Ongoing maintenance: Knowledge base quality degrades over time. Establish monthly quality checks and quarterly rebuild cycles to maintain long-term quality.

Key checklist:

Dataset permission isolation boundaries clearly defined
Cross-knowledge-base result fusion strategy determined (normalization or Rerank)
Document metadata standards established (classification, allowed_roles, valid_until)
Knowledge base routing layer implemented (required when exceeding 5 knowledge bases)
Audit logging system deployed
Document update synchronization process established
Knowledge base quality baseline created and run regularly

Rate this chapter

4.8 / 5 (42 ratings)