Multi-Knowledge-Base Queries and Enterprise Document Permission Management
Chapter 8: Multi-Knowledge-Base Joint Query and Enterprise Document Permission Management
A single knowledge base cannot meet the complex information architecture needs of enterprises; this chapter explains how to search across multiple knowledge bases collaboratively while ensuring data security and permission isolation.
Chapter Overview
When enterprises deploy AI knowledge bases, they almost inevitably encounter this situation: the company has dozens of departments, each with its own document system. HR policy documents, financial reports, technical API docs, and product manuals must be isolated from each other โ yet sometimes cross-department joint search is needed.
Even more complex is the permissions problem: regular employees can only access the product manual, HR specialists can view salary bands, and only managers can see strategic planning documents. How do you implement such a permission system in Dify while maintaining retrieval quality?
This chapter systematically covers:
- Multi-knowledge-base architecture design in Dify
- How to implement cross-knowledge-base joint queries
- Enterprise document permission management strategies
- Operational challenges and solutions for large-scale knowledge bases
Level 1: Fundamentals (1โ3 Years Experience)
1.1 Dify Knowledge Base Hierarchy
A knowledge base (Dataset) in Dify is an independent namespace containing:
- Documents: uploaded source files (PDF, Word, TXT, etc.)
- Segments/Chunks: the smallest retrieval units after document splitting
- Vector index: vector representations of each chunk, stored in the vector database
- Keyword index: inverted index used for full-text search
A single Dify instance can host multiple Datasets, and each application can bind to one or more Datasets.
Key concept: Datasets are the basic unit of permission isolation. Data across different Datasets is completely independent and cannot interfere with each other.
1.2 Binding Multiple Knowledge Bases to a Single Application
The simplest way to use multiple knowledge bases is to bind multiple Datasets to one application.
Steps:
- Open the Dify application orchestration interface
- Click the "Knowledge Base" panel
- Search and add multiple knowledge bases
- Set retrieval parameters for each (independently configurable)
How it works: When a user asks a question, Dify concurrently queries all bound knowledge bases, then merges the retrieval results and sends them together to the model.
Use cases:
- Product Q&A chatbot: simultaneously query product manual + FAQ + troubleshooting knowledge base
- Enterprise assistant: simultaneously query company policy + department rules + industry standards
Limitations:
- All users can access all bound knowledge bases (no permission control)
- Result quality degrades when too many knowledge bases are bound
- Recommendation: bind no more than 5 knowledge bases per application
1.3 Conditional Retrieval via Workflow
For scenarios that require selecting different knowledge bases based on user identity, workflows provide the most flexible approach.
Basic workflow design:
User input
โ
Intent classification node (LLM)
Determines which domain the question belongs to
โ
โโโ "HR Policy" โ Query HR knowledge base
โโโ "Technical Support" โ Query tech docs knowledge base
โโโ "Finance" โ Query finance knowledge base
โโโ Default โ Query general knowledge base
โ
Merge results + model generates answer
Intent classification prompt:
Based on the user's question, determine which category it belongs to
(output only the category name):
- HR_POLICY: employee benefits, leave, salary, performance, etc.
- TECH_SUPPORT: product features, API, technical issues, etc.
- FINANCE: expense reimbursement, budget, financial processes, etc.
- GENERAL: other questions
User question: {{query}}
1.4 Understanding Dify's Permission Model
Dify's native permission model is Workspace-based:
- Owner: highest privilege, can manage all resources
- Admin: can manage knowledge bases, applications, and members
- Editor: can create and edit applications, but cannot manage members
- Member: can only use published applications (chatbot UI)
This is a coarse-grained workspace-level control. For the document-level permissions enterprises need ("only HR can see salary documents"), additional implementation at the application layer is required.
Level 2: Mechanisms in Depth (3โ5 Years Experience)
2.1 Performance Characteristics of Parallel Multi-Knowledge-Base Retrieval
When an application binds multiple knowledge bases, Dify uses concurrent query strategy:
# Pseudocode: Dify multi-dataset retrieval logic
async def multi_dataset_retrieve(query: str, dataset_ids: list) -> list:
tasks = []
for dataset_id in dataset_ids:
task = asyncio.create_task(
single_dataset_retrieve(query, dataset_id)
)
tasks.append(task)
# Wait for all knowledge base retrievals to complete
results = await asyncio.gather(*tasks)
# Flatten and merge
merged = []
for result_list in results:
merged.extend(result_list)
# Sort by score
merged.sort(key=lambda x: x.score, reverse=True)
return merged[:top_k]
Performance implications:
- Retrieval latency = max(individual knowledge base latencies), not sum
- Concurrent retrieval of 5 knowledge bases approaches the latency of a single one
- But Top-K merging produces N ร Top-K total candidates (requiring extra filtering)
Actual latency measurements (each knowledge base P50 = 50ms):
- 1 knowledge base: ~55ms
- 3 knowledge bases: ~65ms
- 10 knowledge bases: ~90ms (main overhead shifts to result merging and Rerank)
2.2 Designing Enterprise-Grade Permission Architecture
Option 1: Dataset Isolation (Recommended โ highest security)
User role โ Authorized Dataset IDs
HR Specialist:
- dataset_company_policy
- dataset_hr_handbook
- dataset_salary_bands (HR-exclusive)
Engineering Manager:
- dataset_company_policy
- dataset_tech_docs
- dataset_engineering_decisions (manager-exclusive)
All Employees:
- dataset_company_policy
- dataset_product_handbook
Implementation:
- Create a dedicated Dify application per permission level
- Each application binds only the knowledge bases allowed for that permission level
- Integrate with SSO/IdP to route users to the appropriate application
Option 2: API-layer Permission Proxy
class DifyPermissionProxy:
def __init__(self, dify_base_url: str, dify_api_key: str):
self.dify = DifyClient(dify_base_url, dify_api_key)
self.permission_db = PermissionDB()
def query(self, user_id: str, query: str, conversation_id: str = None):
# 1. Get Dataset IDs the user is authorized to access
allowed_datasets = self.permission_db.get_allowed_datasets(user_id)
# 2. Dynamically build query request with only authorized knowledge bases
response = self.dify.chat(
query=query,
conversation_id=conversation_id,
extra_context={"allowed_dataset_ids": allowed_datasets}
)
return response
def audit_log(self, user_id: str, query: str, retrieved_docs: list):
# Record which documents the user accessed (compliance requirement)
self.audit_db.log(
user_id=user_id,
query=query,
doc_ids=[doc.id for doc in retrieved_docs],
timestamp=datetime.now()
)
2.3 Result Fusion Strategy for Cross-Knowledge-Base Queries
The biggest challenge with multiple knowledge bases is result heterogeneity: different knowledge bases may use different embedding models or chunking strategies, leading to inconsistent score distributions.
Example problem:
- Knowledge Base A (technical docs): uses text-embedding-3-large, cosine similarity generally high (0.75โ0.90)
- Knowledge Base B (legal documents): uses bge-m3, cosine similarity generally lower (0.45โ0.65)
If merged by raw score, Knowledge Base A results will systematically rank higher regardless of actual relevance.
Solution: Normalized score fusion
def normalize_and_merge(results_by_dataset: dict) -> list:
"""
Apply min-max normalization to each knowledge base's scores before merging
"""
all_results = []
for dataset_id, results in results_by_dataset.items():
if not results:
continue
scores = [r.score for r in results]
min_score = min(scores)
max_score = max(scores)
score_range = max_score - min_score
for result in results:
if score_range > 0:
normalized = (result.score - min_score) / score_range
else:
normalized = 1.0
all_results.append({
"content": result.content,
"source_dataset": dataset_id,
"original_score": result.score,
"normalized_score": normalized,
"metadata": result.metadata
})
all_results.sort(key=lambda x: x["normalized_score"], reverse=True)
return all_results
Even better: unified Rerank
Regardless of each knowledge base's raw scores, apply Rerank to the merged pool:
Knowledge Base A results (Top-20) โโ
Knowledge Base B results (Top-20) โโผโ Merge 60 candidates โ Rerank โ Top-10 final
Knowledge Base C results (Top-20) โโ
Rerank scores based on the query-document pair content itself, independent of original retrieval scores โ effectively eliminating distribution bias.
2.4 Document Metadata Design for Permissions and Filtering
Setting metadata thoughtfully when uploading documents is the key to fine-grained permissions and filtering:
def upload_document_with_metadata(
dataset_id: str,
file_path: str,
metadata: dict
) -> dict:
"""
Metadata design example:
{
"department": "hr",
"classification": "confidential", # public/internal/confidential/secret
"allowed_roles": ["hr_specialist", "hr_manager", "ceo"],
"valid_until": "2025-12-31",
"owner": "user_123",
"version": "v2.1",
"language": "en-US"
}
"""
response = requests.post(
f"{DIFY_BASE_URL}/datasets/{dataset_id}/documents/create_by_file",
headers={"Authorization": f"Bearer {API_KEY}"},
files={"file": open(file_path, "rb")},
data={
"data": json.dumps({
"name": os.path.basename(file_path),
"indexing_technique": "high_quality",
"process_rule": {"mode": "automatic"},
"custom_metadata": metadata
})
}
)
return response.json()
Using metadata filtering at retrieval time:
def search_with_permission(dataset_id: str, query: str, user_role: str) -> list:
"""Permission-based retrieval using metadata filters"""
results = vector_db.search(
query_vector=embed(query),
filter={"allowed_roles": {"contains": user_role}},
limit=20
)
return results
2.5 Knowledge Base Version Management and Update Strategy
In enterprise environments, frequent document updates are the norm.
Strategy 1: Full replacement (for small knowledge bases)
# Delete old document, upload new one
curl -X DELETE "http://dify/api/datasets/{id}/documents/{doc_id}"
curl -X POST "http://dify/api/datasets/{id}/documents/create_by_file" \
--data-binary @new_document.pdf
Strategy 2: Incremental update (for large knowledge bases)
- Maintain hash values for each document
- Scheduled scanning โ only update changed documents
- Manage source files with Git; use diffs to determine the scope of changes
Strategy 3: Blue-green knowledge bases (for mission-critical)
Production KB (Blue) โ All current traffic
โ
Prepare new version KB (Green): re-index all documents
โ
Run evaluation benchmark: Green quality >= Blue quality?
โ Yes
Switch: bind application from Blue to Green
โ
Retain Blue for a period as rollback backup
Level 3: Source Code and Principles (5+ Years Experience)
3.1 Dify Multi-Knowledge-Base Routing Source Code
The multi-knowledge-base routing logic in Dify resides in the workflow engine's knowledge retrieval node (KnowledgeRetrievalNode):
# api/core/workflow/nodes/knowledge_retrieval/knowledge_retrieval_node.py
class KnowledgeRetrievalNode(BaseNode):
def _run(self, variable_pool: VariablePool) -> NodeRunResult:
# Get all bound knowledge base IDs
dataset_ids = self.node_data.dataset_ids
# Retrieve query text from variable pool
query = variable_pool.get_any(self.node_data.query_variable_selector)
# Concurrent retrieval
results = DatasetRetrieval().retrieve(
model_config=self.model_config,
config=self.node_data.retrieval_model,
query=query,
dataset_ids=dataset_ids,
invoke_from=InvokeFrom.WORKFLOW,
hit_callback=self._hit_callback
)
return NodeRunResult(
status=WorkflowNodeExecutionStatus.SUCCEEDED,
outputs={"result": [doc.to_dict() for doc in results]}
)
Key design: DatasetRetrieval.retrieve() accepts a list of dataset_ids, queries them concurrently internally, then merges results. This design makes multi-knowledge-base queries completely transparent to upper layers.
3.2 Vector Database Multi-Tenancy Implementation Details
Using Qdrant as an example, here is how Dify achieves Dataset data isolation:
# api/core/rag/datasource/vdb/qdrant/qdrant_vector.py
class QdrantVector(BaseVector):
# Dify creates a separate Qdrant Collection for each Dataset
# Collection name = "dataset_" + dataset_id
def __init__(self, dataset: Dataset, config: QdrantConfig):
self._collection_name = Dataset.gen_collection_name_by_id(dataset.id)
# โ "dataset_550e8400-e29b-41d4-a716-446655440000"
def search_by_vector(self, query_vector: list[float], **kwargs) -> list[Document]:
return self._client.search(
collection_name=self._collection_name, # Each Dataset = isolated collection
query_vector=query_vector,
limit=kwargs.get('top_k', 4),
with_payload=True,
score_threshold=kwargs.get('score_threshold', 0)
)
Isolation mechanism: Each Dataset corresponds to an independent Collection in Qdrant โ physically isolated, impossible to cross boundaries accidentally.
Multi-tenant scaling: For very large deployments (100,000+ Datasets), one Collection per Dataset causes Collection count explosion. An alternative is shared Collection with namespace filtering:
# Logical isolation via payload filtering in a shared Collection
search_result = client.search(
collection_name="shared_collection",
query_vector=query_embedding,
query_filter=Filter(
must=[FieldCondition(key="dataset_id", match=MatchValue(value=dataset_id))]
)
)
This approach requires modifying Dify source code โ weigh the maintenance cost carefully.
3.3 Dify's BM25 Full-Text Search Implementation
Dify's full-text search is based on database-level full-text indexing rather than a dedicated search engine like Elasticsearch:
For pgvector/PostgreSQL:
-- Simplified Dify keyword index creation
CREATE INDEX ON dataset_keyword_tables
USING gin(to_tsvector('english', content));
-- Query
SELECT *, ts_rank(to_tsvector('english', content), query) AS rank
FROM dataset_keyword_tables
WHERE dataset_id = $1
AND to_tsvector('english', content) @@ plainto_tsquery('english', $2)
ORDER BY rank DESC
LIMIT 20;
Important note: PostgreSQL full-text search for Chinese requires pg_jieba or zhparser extensions. Without these, Chinese keyword retrieval quality will be very poor.
Check and install pg_jieba:
# Check in Dify Docker environment
docker exec dify-postgres psql -U postgres -c "\dx"
# Install pg_jieba if no Chinese tokenization extension found
apt-get install postgresql-14-jieba
# Then in psql:
CREATE EXTENSION pg_jieba;
3.4 Enterprise Audit and Compliance
In regulated industries (finance, healthcare, legal), AI knowledge bases must meet audit requirements:
from dataclasses import dataclass
from datetime import datetime
@dataclass
class RAGAuditLog:
"""Audit record for each RAG query"""
log_id: str
timestamp: datetime
user_id: str
user_role: str
application_id: str
query: str
retrieved_documents: list[dict] # [{doc_id, chunk_id, score, dataset_id}]
model_used: str
response_summary: str # Do not store full response (may contain sensitive info)
session_id: str
ip_address: str
class AuditLogger:
def log_query(self, audit_log: RAGAuditLog):
# Write to tamper-proof audit database
self.db.insert(audit_log.to_dict())
# Detect sensitive document access
sensitive_docs = [
doc for doc in audit_log.retrieved_documents
if doc.get("classification") == "secret"
]
if sensitive_docs:
self.alert_security_team(audit_log, sensitive_docs)
def generate_compliance_report(
self, start_date: datetime, end_date: datetime
) -> dict:
"""Generate compliance report: who accessed what documents, when"""
return self.db.aggregate({
"date_range": [start_date, end_date],
"group_by": ["user_id", "dataset_id"],
"metrics": ["query_count", "unique_docs_accessed"]
})
Level 4: Production Pitfalls and Decision-Making (Expert Perspective)
4.1 Pitfall 1: Performance Degradation from Knowledge Base Count Explosion
Symptom: As the business expands, the number of knowledge bases keeps growing. An application bound to 15+ knowledge bases sees retrieval latency jump from 200ms to 2s+.
Root causes:
- More concurrent queries increase database connection pool pressure
- Candidate count explosion (15 ร Top-20 = 300 candidates) dramatically increases Rerank computation
- Result quality actually degrades: noise from irrelevant knowledge bases accumulates
Solution: Knowledge base routing layer
Rather than binding all knowledge bases directly to an application, add a routing layer:
class KnowledgeBaseRouter:
def __init__(self, all_datasets: dict, embedding_model):
self.datasets = all_datasets
# Create vectors for each knowledge base's description
self.dataset_embeddings = {
ds_id: embedding_model.encode(ds_info["description"])
for ds_id, ds_info in all_datasets.items()
}
def route(self, query: str, max_datasets: int = 3) -> list[str]:
"""Select the N most relevant knowledge bases"""
query_embedding = self.embedding_model.encode(query)
scores = {
ds_id: cosine_similarity(query_embedding, ds_emb)
for ds_id, ds_emb in self.dataset_embeddings.items()
}
sorted_datasets = sorted(
scores.items(), key=lambda x: x[1], reverse=True
)
return [ds_id for ds_id, _ in sorted_datasets[:max_datasets]]
Regardless of how many knowledge bases exist in total, each query only accesses the most relevant 2โ3, keeping performance predictable.
4.2 Pitfall 2: Duplicate Documents Across Knowledge Bases
Symptom: The same document exists in multiple knowledge bases (e.g., a company policy is in both the HR knowledge base and the all-employees knowledge base), resulting in duplicate content in retrieval results and redundant information sent to the model.
Solution: Deduplication layer
def deduplicate_results(results: list) -> list:
"""Deduplicate based on content hash"""
seen_hashes = set()
deduplicated = []
for result in results:
# Hash the first 200 characters of content (faster, avoids full comparison)
content_hash = hashlib.md5(
result["content"][:200].encode()
).hexdigest()
if content_hash not in seen_hashes:
seen_hashes.add(content_hash)
deduplicated.append(result)
return deduplicated
Better approach: establish a "master document library" where other knowledge bases reference documents rather than copying them. Dify does not natively support this โ it requires implementation at the application layer.
4.3 Pitfall 3: Permission Bypass Vulnerabilities
Problem: Even with Dataset isolation, if an application's API key is leaked, an attacker can call the Dify API directly to query any knowledge base.
Protective measures:
- Least-privilege API keys: Use separate API keys per application; enable API key scope restrictions in Dify
- API key rotation: Rotate API keys regularly; invalidate old keys immediately
- Request signing: Add HMAC signatures at the proxy layer to ensure only authorized clients can call
- Network isolation: Dify service is not exposed to the public internet; accessible only on internal network
import hmac
import hashlib
import time
def verify_request_signature(request_body: str, signature: str, secret: str) -> bool:
"""Verify request signature to prevent misuse after API key exposure"""
timestamp = int(time.time())
# Signature includes timestamp to prevent replay attacks (5-minute validity)
message = f"{timestamp}:{request_body}"
expected = hmac.new(
secret.encode(),
message.encode(),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(expected, signature)
4.4 Multi-Knowledge-Base Architecture Decision Tree
Do you need document-level permission control?
โ
โโโ No โ Single application binding multiple knowledge bases
โ (simple, good performance)
โ
โโโ Yes
โ
โโโ Fewer than 10 user roles?
โ โโโ Yes โ One application per role, bind corresponding knowledge bases
โ
โโโ More than 10 roles, or need fine-grained document permissions?
โ
โโโ Document count < 1M โ pgvector + metadata filtering
โโโ Document count > 1M โ Qdrant/Weaviate + payload filtering
+ API proxy layer + audit logs
Chapter Summary
Multi-knowledge-base architecture is an inevitable step in enterprise AI knowledge base deployment. The core challenge is balancing flexibility with security:
Architecture selection: For fewer than 10 user roles, prioritize Dataset isolation with the multi-application approach. For more than 10 permission types, introduce an API proxy layer and metadata filtering.
Performance optimization: The knowledge base routing layer is the key to managing knowledge base count explosion โ ensure each query accesses only the most relevant 2โ3 knowledge bases.
Compliance guarantee: Regulated industries must build comprehensive audit log systems recording who accessed which documents, when.
Ongoing maintenance: Knowledge base quality degrades over time. Establish monthly quality checks and quarterly rebuild cycles to maintain long-term quality.
Key checklist:
- Dataset permission isolation boundaries clearly defined
- Cross-knowledge-base result fusion strategy determined (normalization or Rerank)
- Document metadata standards established (classification, allowed_roles, valid_until)
- Knowledge base routing layer implemented (required when exceeding 5 knowledge bases)
- Audit logging system deployed
- Document update synchronization process established
- Knowledge base quality baseline created and run regularly