私有化部署:Docker Compose / K8s / 高可用架构
第19章:私有化部署——Docker Compose / K8s / 高可用架构
把 Dify 从云端 SaaS 搬进自己机房,既保住数据主权,又能在万人规模下稳定运行——本章给出可直接落地的完整方案。
本章导读
大多数企业在用了一段时间 Dify Cloud 之后,都会遇到同样的拦路虎:数据不能出境、合规审计要看原始日志、并发一上来就限速。私有化部署不是把 docker-compose up 敲一遍那么简单,它涉及网络规划、存储选型、服务编排、灰度升级、故障恢复等一整套工程决策。
本章将从三个层次递进讲解:
- 单机 Docker Compose:适合 PoC / 小团队(< 50 人)
- 多节点 Docker Swarm / 独立部署:适合中型团队(50-500 人)
- Kubernetes 高可用集群:适合大型企业(> 500 人,SLA ≥ 99.9%)
读完本章,你将能够:
- 独立完成 Dify 的生产级私有化部署
- 针对不同规模选择合适的架构方案
- 配置 Nginx 反向代理、SSL 终止、会话保持
- 设计故障转移和零停机升级流程
Level 1:基础认知(1-3 年经验)
Dify 的组件全景
在动手部署之前,必须先弄清楚 Dify 到底由哪些服务组成。Dify 官方 Docker Compose 文件包含以下核心服务:
| 服务名 | 作用 | 对外端口 |
|---|---|---|
api |
后端 API 服务(Flask) | 5001 |
worker |
Celery 异步任务(文档索引等) | 无 |
web |
前端 Next.js 应用 | 3000 |
db |
PostgreSQL 数据库 | 5432 |
redis |
缓存 + 消息队列 | 6379 |
weaviate |
向量数据库(默认) | 8080 |
sandbox |
代码执行沙箱 | 8194 |
nginx |
反向代理入口 | 80/443 |
类比理解:可以把 Dify 想象成一家餐厅——web 是前台接待,api 是厨师长,worker 是洗碗工(处理耗时任务),db 是食材仓库,redis 是传菜台(快速缓存),weaviate 是菜谱索引系统,nginx 是大门保安。
单机 Docker Compose 部署
系统要求(生产最低配置):
- CPU:4 核
- 内存:8 GB(推荐 16 GB)
- 磁盘:SSD 100 GB+
- 操作系统:Ubuntu 22.04 LTS / Debian 12
步骤一:安装 Docker
# Ubuntu 22.04
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
# 验证安装
docker --version # Docker version 24.x
docker compose version # Docker Compose version v2.x
步骤二:克隆并配置
git clone https://github.com/langgenius/dify.git
cd dify/docker
# 复制环境变量模板
cp .env.example .env
步骤三:关键环境变量配置
# .env 文件核心配置(生产必改项)
# 安全密钥(随机生成,勿用默认值)
SECRET_KEY=your-super-secret-key-$(openssl rand -hex 32)
# 数据库配置
DB_USERNAME=dify
DB_PASSWORD=强密码请替换
DB_HOST=db
DB_PORT=5432
DB_DATABASE=dify
# Redis 配置
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_PASSWORD=redis强密码
# 向量数据库选择(weaviate/qdrant/milvus/pgvector)
VECTOR_STORE=weaviate
# 存储后端(local/s3/azure-blob/google-storage)
STORAGE_TYPE=local
STORAGE_LOCAL_PATH=/app/api/storage
# 允许的域名(生产必须设置)
CONSOLE_WEB_URL=https://dify.yourcompany.com
APP_WEB_URL=https://dify.yourcompany.com
步骤四:启动服务
# 后台启动所有服务
docker compose up -d
# 查看服务状态
docker compose ps
# 查看日志
docker compose logs -f api
步骤五:初始化管理员账号
首次启动后,访问 http://your-server-ip 完成初始化向导,设置管理员邮箱和密码。
Nginx 反向代理配置
生产环境必须在 Dify 前面加一层 Nginx,负责 SSL 终止、静态文件缓存和访问控制。
# /etc/nginx/sites-available/dify.conf
upstream dify_web {
server 127.0.0.1:3000;
}
upstream dify_api {
server 127.0.0.1:5001;
}
server {
listen 80;
server_name dify.yourcompany.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
server_name dify.yourcompany.com;
ssl_certificate /etc/ssl/certs/dify.crt;
ssl_certificate_key /etc/ssl/private/dify.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
# 前端路由
location / {
proxy_pass http://dify_web;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# API 路由
location /api/ {
proxy_pass http://dify_api;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s; # LLM 响应可能较慢
proxy_send_timeout 300s;
# SSE 流式响应支持
proxy_buffering off;
proxy_cache off;
}
# 文件上传大小限制
client_max_body_size 100m;
}
Level 2:机制深解(3-5 年经验)
Docker Compose 生产级配置
官方默认的 docker-compose.yaml 适合快速体验,但生产环境需要做大量调整。下面是一份经过优化的生产级配置:
# docker-compose.prod.yaml
version: '3.8'
services:
api:
image: langgenius/dify-api:0.10.0
restart: always
environment:
MODE: api
LOG_LEVEL: INFO
SECRET_KEY: ${SECRET_KEY}
DB_USERNAME: ${DB_USERNAME}
DB_PASSWORD: ${DB_PASSWORD}
DB_HOST: db
DB_PORT: 5432
DB_DATABASE: ${DB_DATABASE}
REDIS_HOST: redis
REDIS_PORT: 6379
REDIS_PASSWORD: ${REDIS_PASSWORD}
CELERY_BROKER_URL: redis://:${REDIS_PASSWORD}@redis:6379/1
VECTOR_STORE: ${VECTOR_STORE:-weaviate}
WEAVIATE_ENDPOINT: http://weaviate:8080
STORAGE_TYPE: ${STORAGE_TYPE:-local}
STORAGE_LOCAL_PATH: /app/api/storage
# 生产安全配置
WEB_API_CORS_ALLOW_ORIGINS: ${CONSOLE_WEB_URL}
CONSOLE_CORS_ALLOW_ORIGINS: ${CONSOLE_WEB_URL}
volumes:
- dify_storage:/app/api/storage
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
deploy:
resources:
limits:
memory: 2G
cpus: '2'
reservations:
memory: 512M
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:5001/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "3"
worker:
image: langgenius/dify-api:0.10.0
restart: always
environment:
MODE: worker
LOG_LEVEL: INFO
SECRET_KEY: ${SECRET_KEY}
DB_USERNAME: ${DB_USERNAME}
DB_PASSWORD: ${DB_PASSWORD}
DB_HOST: db
DB_PORT: 5432
DB_DATABASE: ${DB_DATABASE}
REDIS_HOST: redis
REDIS_PORT: 6379
REDIS_PASSWORD: ${REDIS_PASSWORD}
CELERY_BROKER_URL: redis://:${REDIS_PASSWORD}@redis:6379/1
VECTOR_STORE: ${VECTOR_STORE:-weaviate}
WEAVIATE_ENDPOINT: http://weaviate:8080
STORAGE_TYPE: ${STORAGE_TYPE:-local}
STORAGE_LOCAL_PATH: /app/api/storage
volumes:
- dify_storage:/app/api/storage
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
deploy:
resources:
limits:
memory: 4G
cpus: '4'
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "3"
web:
image: langgenius/dify-web:0.10.0
restart: always
environment:
CONSOLE_API_URL: ${CONSOLE_WEB_URL}
APP_API_URL: ${APP_WEB_URL}
deploy:
resources:
limits:
memory: 512M
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/"]
interval: 30s
timeout: 10s
retries: 3
db:
image: postgres:15-alpine
restart: always
environment:
POSTGRES_USER: ${DB_USERNAME}
POSTGRES_PASSWORD: ${DB_PASSWORD}
POSTGRES_DB: ${DB_DATABASE}
# 性能调优
POSTGRES_INITDB_ARGS: "--encoding=UTF8"
command: >
postgres
-c shared_buffers=256MB
-c max_connections=200
-c work_mem=4MB
-c maintenance_work_mem=64MB
-c effective_cache_size=512MB
-c wal_level=replica
volumes:
- postgres_data:/var/lib/postgresql/data
- ./postgres/init:/docker-entrypoint-initdb.d
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${DB_USERNAME} -d ${DB_DATABASE}"]
interval: 10s
timeout: 5s
retries: 5
deploy:
resources:
limits:
memory: 2G
redis:
image: redis:7-alpine
restart: always
command: >
redis-server
--requirepass ${REDIS_PASSWORD}
--maxmemory 512mb
--maxmemory-policy allkeys-lru
--save 60 1000
--appendonly yes
volumes:
- redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
interval: 10s
timeout: 5s
retries: 5
weaviate:
image: semitechnologies/weaviate:1.24.1
restart: always
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'false'
AUTHENTICATION_APIKEY_ENABLED: 'true'
AUTHENTICATION_APIKEY_ALLOWED_KEYS: ${WEAVIATE_API_KEY}
AUTHENTICATION_APIKEY_USERS: dify
PERSISTENCE_DATA_PATH: /var/lib/weaviate
DEFAULT_VECTORIZER_MODULE: none
ENABLE_MODULES: ''
CLUSTER_HOSTNAME: node1
volumes:
- weaviate_data:/var/lib/weaviate
deploy:
resources:
limits:
memory: 4G
sandbox:
image: langgenius/dify-sandbox:0.2.10
restart: always
environment:
API_KEY: ${SANDBOX_API_KEY}
GIN_MODE: release
WORKER_TIMEOUT: 15
deploy:
resources:
limits:
memory: 512M
cpus: '1'
volumes:
postgres_data:
driver: local
driver_opts:
type: none
o: bind
device: /data/dify/postgres
redis_data:
driver: local
driver_opts:
type: none
o: bind
device: /data/dify/redis
weaviate_data:
driver: local
driver_opts:
type: none
o: bind
device: /data/dify/weaviate
dify_storage:
driver: local
driver_opts:
type: none
o: bind
device: /data/dify/storage
数据备份策略
#!/bin/bash
# /opt/dify/backup.sh — 每日备份脚本
BACKUP_DIR="/backup/dify"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30
mkdir -p "$BACKUP_DIR"
# 1. 备份 PostgreSQL
docker exec dify-db-1 pg_dump \
-U $DB_USERNAME \
-d $DB_DATABASE \
--format=custom \
--no-acl \
> "$BACKUP_DIR/postgres_${DATE}.dump"
# 2. 备份存储文件(知识库文档等)
tar -czf "$BACKUP_DIR/storage_${DATE}.tar.gz" \
-C /data/dify storage/
# 3. 备份 Weaviate 数据
tar -czf "$BACKUP_DIR/weaviate_${DATE}.tar.gz" \
-C /data/dify weaviate/
# 4. 上传到远端(可选:OSS/S3)
# aws s3 sync "$BACKUP_DIR/" "s3://your-backup-bucket/dify/"
# 5. 清理过期备份
find "$BACKUP_DIR" -name "*.dump" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete
echo "[$(date)] Backup completed: $DATE"
# 添加 crontab 定时任务
crontab -e
# 每天凌晨 2:00 执行备份
0 2 * * * /opt/dify/backup.sh >> /var/log/dify-backup.log 2>&1
零停机升级流程
#!/bin/bash
# /opt/dify/upgrade.sh
NEW_VERSION=$1
if [ -z "$NEW_VERSION" ]; then
echo "Usage: $0 <version>"
exit 1
fi
echo "=== Dify 升级到 $NEW_VERSION ==="
# Step 1: 拉取新镜像
docker pull langgenius/dify-api:$NEW_VERSION
docker pull langgenius/dify-web:$NEW_VERSION
# Step 2: 备份数据库
./backup.sh
# Step 3: 更新镜像标签
sed -i "s/dify-api:[0-9.]*/dify-api:$NEW_VERSION/g" docker-compose.prod.yaml
sed -i "s/dify-web:[0-9.]*/dify-web:$NEW_VERSION/g" docker-compose.prod.yaml
# Step 4: 滚动重启(先 worker,再 api,最后 web)
docker compose -f docker-compose.prod.yaml up -d --no-deps worker
sleep 30
# Step 5: 执行数据库迁移
docker compose -f docker-compose.prod.yaml exec api flask db upgrade
# Step 6: 重启 api
docker compose -f docker-compose.prod.yaml up -d --no-deps api
sleep 30
# Step 7: 重启 web
docker compose -f docker-compose.prod.yaml up -d --no-deps web
echo "=== 升级完成 ==="
docker compose -f docker-compose.prod.yaml ps
常见坑点与解决方案
坑1:Weaviate 内存溢出
症状:向量搜索慢、容器频繁重启。
原因:Weaviate 默认配置没有内存上限,文档多时会耗尽宿主机内存。
解决:
# docker-compose.yaml 中添加
weaviate:
environment:
LIMIT_RESOURCES: 'true'
# 或手动控制 JVM 堆大小
JAVA_OPTS: "-Xmx2g -Xms512m"
deploy:
resources:
limits:
memory: 4G
坑2:PostgreSQL 连接数耗尽
症状:FATAL: remaining connection slots are reserved for non-replication superuser connections
原因:Dify API 多进程 + Celery Worker 同时连接,默认 100 连接不够用。
解决:
# 方案1:增加 PG 最大连接数
postgres -c max_connections=500
# 方案2(推荐):加 PgBouncer 连接池
# 见 Level 3 详细配置
坑3:SSE 流式响应被 Nginx 缓冲截断
症状:流式对话在用户端看不到流式效果,要等全部响应才出现。
解决:
location /api/ {
proxy_buffering off;
proxy_cache off;
proxy_set_header Connection '';
proxy_http_version 1.1;
chunked_transfer_encoding on;
}
Level 3:源码与原理(5 年以上)
Kubernetes 高可用部署
对于 500 人以上企业,单机 Docker Compose 的单点故障风险无法接受,需要迁移到 Kubernetes。
集群规划(以 1000 人企业为例):
控制节点(3 台,高可用 etcd)
- master-01: 8C/16G
- master-02: 8C/16G
- master-03: 8C/16G
工作节点(按负载分组)
- app-nodes(3台): 16C/32G — 运行 api / web / worker
- db-nodes(2台): 32C/128G NVMe — 运行 PostgreSQL / Redis
- vector-nodes(2台): 32C/64G — 运行 Weaviate
Namespace 规划:
# namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
name: dify-prod
labels:
environment: production
---
apiVersion: v1
kind: Namespace
metadata:
name: dify-infra
labels:
environment: production
Dify API Deployment:
# api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: dify-api
namespace: dify-prod
labels:
app: dify-api
version: "0.10.0"
spec:
replicas: 3 # 三副本保证高可用
selector:
matchLabels:
app: dify-api
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # 零停机更新
template:
metadata:
labels:
app: dify-api
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: dify-api
topologyKey: kubernetes.io/hostname # 强制跨节点部署
containers:
- name: api
image: langgenius/dify-api:0.10.0
env:
- name: MODE
value: api
- name: SECRET_KEY
valueFrom:
secretKeyRef:
name: dify-secrets
key: secret-key
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: dify-secrets
key: db-password
envFrom:
- configMapRef:
name: dify-api-config
ports:
- containerPort: 5001
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
readinessProbe:
httpGet:
path: /health
port: 5001
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 5001
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 5
volumeMounts:
- name: storage
mountPath: /app/api/storage
volumes:
- name: storage
persistentVolumeClaim:
claimName: dify-storage-pvc
---
apiVersion: v1
kind: Service
metadata:
name: dify-api-svc
namespace: dify-prod
spec:
selector:
app: dify-api
ports:
- port: 5001
targetPort: 5001
type: ClusterIP
HPA 自动扩缩容:
# api-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: dify-api-hpa
namespace: dify-prod
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: dify-api
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # 缩容保守,避免抖动
Ingress 配置(带 SSL 和限速):
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: dify-ingress
namespace: dify-prod
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-buffering: "off"
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
# 限速:每 IP 每分钟 60 请求
nginx.ingress.kubernetes.io/limit-rps: "60"
nginx.ingress.kubernetes.io/limit-connections: "20"
spec:
tls:
- hosts:
- dify.yourcompany.com
secretName: dify-tls
rules:
- host: dify.yourcompany.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: dify-web-svc
port:
number: 3000
- path: /api
pathType: Prefix
backend:
service:
name: dify-api-svc
port:
number: 5001
- path: /v1
pathType: Prefix
backend:
service:
name: dify-api-svc
port:
number: 5001
- path: /console/api
pathType: Prefix
backend:
service:
name: dify-api-svc
port:
number: 5001
PostgreSQL 高可用:Patroni + HAProxy
单节点 PostgreSQL 是最大的单点故障。生产环境推荐使用 Patroni 实现自动故障转移:
# patroni.yaml(ConfigMap 中的配置文件)
scope: dify-postgres
namespace: /dify/
name: pg-node-1
restapi:
listen: 0.0.0.0:8008
connect_address: ${POD_IP}:8008
etcd3:
hosts: etcd-0.etcd:2379,etcd-1.etcd:2379,etcd-2.etcd:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 30
maximum_lag_on_failover: 1048576 # 1MB
postgresql:
use_pg_rewind: true
use_slots: true
parameters:
max_connections: 500
shared_buffers: 4GB
effective_cache_size: 12GB
work_mem: 16MB
maintenance_work_mem: 512MB
wal_level: replica
max_wal_senders: 5
hot_standby: on
initdb:
- encoding: UTF8
- data-checksums
pg_hba:
- host replication replicator 0.0.0.0/0 md5
- host all all 0.0.0.0/0 md5
postgresql:
listen: 0.0.0.0:5432
connect_address: ${POD_IP}:5432
data_dir: /var/lib/postgresql/data/pgdata
authentication:
replication:
username: replicator
password: ${REPLICATION_PASSWORD}
superuser:
username: postgres
password: ${POSTGRES_PASSWORD}
PgBouncer 连接池
# pgbouncer.ini
[databases]
dify = host=pg-primary port=5432 dbname=dify
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
# 连接池模式(transaction 模式对 Dify 最合适)
pool_mode = transaction
max_client_conn = 1000 # 允许最多 1000 个客户端连接
default_pool_size = 50 # 每个数据库维护 50 个后端连接
min_pool_size = 10
reserve_pool_size = 10
reserve_pool_timeout = 5
# 超时配置
server_idle_timeout = 600
client_idle_timeout = 0
query_timeout = 0
connect_timeout = 15
# 统计日志
stats_period = 60
log_connections = 0
log_disconnections = 0
Weaviate 集群化
# weaviate-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: weaviate
namespace: dify-infra
spec:
serviceName: weaviate
replicas: 3
selector:
matchLabels:
app: weaviate
template:
metadata:
labels:
app: weaviate
spec:
containers:
- name: weaviate
image: semitechnologies/weaviate:1.24.1
env:
- name: CLUSTER_HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: CLUSTER_GOSSIP_BIND_PORT
value: "7000"
- name: CLUSTER_DATA_BIND_PORT
value: "7001"
- name: CLUSTER_JOIN
value: "weaviate-0.weaviate,weaviate-1.weaviate,weaviate-2.weaviate"
- name: PERSISTENCE_DATA_PATH
value: /var/lib/weaviate
- name: AUTHENTICATION_APIKEY_ENABLED
value: "true"
- name: AUTHENTICATION_APIKEY_ALLOWED_KEYS
valueFrom:
secretKeyRef:
name: weaviate-secrets
key: api-key
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
volumeMounts:
- name: weaviate-data
mountPath: /var/lib/weaviate
volumeClaimTemplates:
- metadata:
name: weaviate-data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: fast-ssd
resources:
requests:
storage: 200Gi
Level 4:生产陷阱与决策(专家视角)
架构选型决策树
在实际项目中,面对"用 Docker Compose 还是 K8s"这个问题,很多团队做出了错误选择。以下是基于真实案例的决策框架:
案例一:某金融科技公司(300人)
初期选择了 K8s,理由是"高可用"。结果:
- 运维复杂度超出团队能力,部署一次需要 3 小时
- etcd 数据损坏后无人能修复,停机 6 小时
- 3 个月后回退到 Docker Compose + 手动主备切换
正确决策:300 人以下且没有专职 K8s 运维工程师的团队,用 Docker Compose + 定期备份 + 冷备主机更务实。
案例二:某制造企业(800人)
选择了 K8s,但忽视了有状态服务(PostgreSQL、Weaviate)的复杂性,直接把数据库跑在 K8s 上。结果:
- PVC 存储性能不达标,查询 p99 > 2s
- K8s 节点维护时数据库迁移风险高
正确决策:K8s 只跑无状态的 api/web/worker,PostgreSQL 和 Weaviate 跑在独立的物理机或云数据库上(RDS/Managed Weaviate)。
高可用架构参考(真实生产)
┌─────────────────────────────┐
│ DNS + CDN 层 │
│ (Cloudflare / 阿里云 CDN) │
└──────────────┬──────────────┘
│
┌──────────────▼──────────────┐
│ 负载均衡层(双活) │
│ LB-01 ←→ LB-02 (Keepalived)│
│ VRRP 虚拟 IP │
└──────────────┬──────────────┘
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ app-01 │ │ app-02 │ │ app-03 │
│ api+web │ │ api+web │ │ api+web │
└──────────┘ └──────────┘ └──────────┘
│ │ │
┌─────────────────────────────────┐
│ Redis Sentinel 集群 │
│ master + 2 replicas + 3 sentinels│
└─────────────────────────────────┘
│
┌─────────────────────────────────┐
│ PostgreSQL Patroni 集群 │
│ primary + 2 standbys (sync rep) │
│ HAProxy 前端 │
└─────────────────────────────────┘
存储选型的坑
坑:local storage 在 K8s 中不可用
Dify 默认使用本地文件存储上传的文档,但在 K8s 多副本场景下,不同 Pod 读到的文件不一致。
解决方案对比:
| 方案 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
| NFS | 简单,无需额外费用 | 性能差,单点故障 | 小规模测试 |
| Ceph/GlusterFS | 高可用,高性能 | 运维复杂 | 自建 K8s |
| 阿里云 OSS/AWS S3 | 零运维,高可用 | 网络延迟,有费用 | 云上生产 |
| MinIO | S3 兼容,自托管 | 需要维护 | 数据不出境要求 |
推荐配置(MinIO + Dify):
# .env 修改存储后端
STORAGE_TYPE=s3
S3_ENDPOINT=http://minio:9000
S3_BUCKET_NAME=dify-storage
S3_ACCESS_KEY=minio-access-key
S3_SECRET_KEY=minio-secret-key
S3_REGION=us-east-1 # MinIO 兼容模式下随意填写
版本升级的危险地带
Dify 每次大版本升级(如 0.9.x → 0.10.x)都可能有 breaking change:
- 数据库 Schema 变更:Flask-Migrate 迁移脚本可能有 bug,升级前务必备份并在测试环境先跑
- 向量数据库 Schema 变更:Weaviate Collection Schema 不向后兼容时需要重建索引,耗时可能数小时
- 环境变量重命名:新版本可能废弃某些 env var,不设置就用默认值,导致功能静默失效
安全升级 SOP:
1. 在完全相同配置的测试环境跑新版本 ≥ 24h
2. 生产备份(数据库 + 向量库 + 存储文件)
3. 开维护窗口,挂维护页
4. 执行升级脚本
5. 冒烟测试(10 分钟内覆盖核心流程)
6. 恢复维护窗口
7. 保留旧镜像 7 天用于回滚
安全加固清单
# 1. 修改默认端口(避免端口扫描直接打到服务)
# 2. 限制 Docker API 暴露
# 不要绑定 0.0.0.0
dockerd --host=unix:///var/run/docker.sock
# 3. 隔离网络:只让 nginx 暴露 80/443
# docker-compose.yaml
networks:
frontend:
# 只有 nginx 加入
backend:
internal: true # 完全内网,不可外访
# 4. 定期扫描镜像漏洞
docker scout cves langgenius/dify-api:0.10.0
# 5. 限制容器权限
security_opt:
- no-new-privileges:true
cap_drop:
- ALL
read_only: true
# 6. 审计日志
# 在 .env 中启用
ENABLE_AUDIT_LOG=true
本章小结
核心要点:
-
选型原则:< 100 人用 Docker Compose,100-500 人考虑双机热备,> 500 人上 K8s(但数据库保持独立)。
-
必改配置:生产环境
SECRET_KEY、数据库密码、Redis 密码必须强密码,切勿使用默认值。 -
高可用核心:真正的高可用需要 PostgreSQL 主备(Patroni)+ Redis Sentinel + 多副本 API。
-
存储要双活:K8s 多副本时必须使用共享存储(S3/MinIO/OSS),本地存储会导致数据不一致。
-
升级要谨慎:任何 Dify 升级前必须备份数据库,先在测试环境验证,再上生产。
-
安全不能省:网络隔离、最小权限、定期漏洞扫描是生产环境的基本要求。
实用命令速查:
# 查看所有服务状态
docker compose ps
# 查看某服务内存使用
docker stats dify-api-1
# 进入容器调试
docker compose exec api bash
# 查看数据库连接数
docker compose exec db psql -U dify -c "SELECT count(*) FROM pg_stat_activity;"
# 强制重建某服务
docker compose up -d --force-recreate --no-deps api
# K8s 查看 Pod 状态
kubectl get pods -n dify-prod -w
# K8s 滚动重启
kubectl rollout restart deployment/dify-api -n dify-prod