第 19 章

私有化部署:Docker Compose / K8s / 高可用架构

第19章:私有化部署——Docker Compose / K8s / 高可用架构

把 Dify 从云端 SaaS 搬进自己机房,既保住数据主权,又能在万人规模下稳定运行——本章给出可直接落地的完整方案。

本章导读

大多数企业在用了一段时间 Dify Cloud 之后,都会遇到同样的拦路虎:数据不能出境、合规审计要看原始日志、并发一上来就限速。私有化部署不是把 docker-compose up 敲一遍那么简单,它涉及网络规划、存储选型、服务编排、灰度升级、故障恢复等一整套工程决策。

本章将从三个层次递进讲解:

  1. 单机 Docker Compose:适合 PoC / 小团队(< 50 人)
  2. 多节点 Docker Swarm / 独立部署:适合中型团队(50-500 人)
  3. Kubernetes 高可用集群:适合大型企业(> 500 人,SLA ≥ 99.9%)

读完本章,你将能够:


Level 1:基础认知(1-3 年经验)

Dify 的组件全景

在动手部署之前,必须先弄清楚 Dify 到底由哪些服务组成。Dify 官方 Docker Compose 文件包含以下核心服务:

服务名 作用 对外端口
api 后端 API 服务(Flask) 5001
worker Celery 异步任务(文档索引等)
web 前端 Next.js 应用 3000
db PostgreSQL 数据库 5432
redis 缓存 + 消息队列 6379
weaviate 向量数据库(默认) 8080
sandbox 代码执行沙箱 8194
nginx 反向代理入口 80/443

类比理解:可以把 Dify 想象成一家餐厅——web 是前台接待,api 是厨师长,worker 是洗碗工(处理耗时任务),db 是食材仓库,redis 是传菜台(快速缓存),weaviate 是菜谱索引系统,nginx 是大门保安。

单机 Docker Compose 部署

系统要求(生产最低配置):

步骤一:安装 Docker

# Ubuntu 22.04
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker

# 验证安装
docker --version   # Docker version 24.x
docker compose version  # Docker Compose version v2.x

步骤二:克隆并配置

git clone https://github.com/langgenius/dify.git
cd dify/docker

# 复制环境变量模板
cp .env.example .env

步骤三:关键环境变量配置

# .env 文件核心配置(生产必改项)

# 安全密钥(随机生成,勿用默认值)
SECRET_KEY=your-super-secret-key-$(openssl rand -hex 32)

# 数据库配置
DB_USERNAME=dify
DB_PASSWORD=强密码请替换
DB_HOST=db
DB_PORT=5432
DB_DATABASE=dify

# Redis 配置
REDIS_HOST=redis
REDIS_PORT=6379
REDIS_PASSWORD=redis强密码

# 向量数据库选择(weaviate/qdrant/milvus/pgvector)
VECTOR_STORE=weaviate

# 存储后端(local/s3/azure-blob/google-storage)
STORAGE_TYPE=local
STORAGE_LOCAL_PATH=/app/api/storage

# 允许的域名(生产必须设置)
CONSOLE_WEB_URL=https://dify.yourcompany.com
APP_WEB_URL=https://dify.yourcompany.com

步骤四:启动服务

# 后台启动所有服务
docker compose up -d

# 查看服务状态
docker compose ps

# 查看日志
docker compose logs -f api

步骤五:初始化管理员账号

首次启动后,访问 http://your-server-ip 完成初始化向导,设置管理员邮箱和密码。

Nginx 反向代理配置

生产环境必须在 Dify 前面加一层 Nginx,负责 SSL 终止、静态文件缓存和访问控制。

# /etc/nginx/sites-available/dify.conf

upstream dify_web {
    server 127.0.0.1:3000;
}

upstream dify_api {
    server 127.0.0.1:5001;
}

server {
    listen 80;
    server_name dify.yourcompany.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    server_name dify.yourcompany.com;

    ssl_certificate     /etc/ssl/certs/dify.crt;
    ssl_certificate_key /etc/ssl/private/dify.key;
    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         HIGH:!aNULL:!MD5;

    # 前端路由
    location / {
        proxy_pass http://dify_web;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # API 路由
    location /api/ {
        proxy_pass http://dify_api;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 300s;  # LLM 响应可能较慢
        proxy_send_timeout 300s;
        
        # SSE 流式响应支持
        proxy_buffering off;
        proxy_cache off;
    }

    # 文件上传大小限制
    client_max_body_size 100m;
}

Level 2:机制深解(3-5 年经验)

Docker Compose 生产级配置

官方默认的 docker-compose.yaml 适合快速体验,但生产环境需要做大量调整。下面是一份经过优化的生产级配置:

# docker-compose.prod.yaml
version: '3.8'

services:
  api:
    image: langgenius/dify-api:0.10.0
    restart: always
    environment:
      MODE: api
      LOG_LEVEL: INFO
      SECRET_KEY: ${SECRET_KEY}
      DB_USERNAME: ${DB_USERNAME}
      DB_PASSWORD: ${DB_PASSWORD}
      DB_HOST: db
      DB_PORT: 5432
      DB_DATABASE: ${DB_DATABASE}
      REDIS_HOST: redis
      REDIS_PORT: 6379
      REDIS_PASSWORD: ${REDIS_PASSWORD}
      CELERY_BROKER_URL: redis://:${REDIS_PASSWORD}@redis:6379/1
      VECTOR_STORE: ${VECTOR_STORE:-weaviate}
      WEAVIATE_ENDPOINT: http://weaviate:8080
      STORAGE_TYPE: ${STORAGE_TYPE:-local}
      STORAGE_LOCAL_PATH: /app/api/storage
      # 生产安全配置
      WEB_API_CORS_ALLOW_ORIGINS: ${CONSOLE_WEB_URL}
      CONSOLE_CORS_ALLOW_ORIGINS: ${CONSOLE_WEB_URL}
    volumes:
      - dify_storage:/app/api/storage
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '2'
        reservations:
          memory: 512M
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:5001/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "3"

  worker:
    image: langgenius/dify-api:0.10.0
    restart: always
    environment:
      MODE: worker
      LOG_LEVEL: INFO
      SECRET_KEY: ${SECRET_KEY}
      DB_USERNAME: ${DB_USERNAME}
      DB_PASSWORD: ${DB_PASSWORD}
      DB_HOST: db
      DB_PORT: 5432
      DB_DATABASE: ${DB_DATABASE}
      REDIS_HOST: redis
      REDIS_PORT: 6379
      REDIS_PASSWORD: ${REDIS_PASSWORD}
      CELERY_BROKER_URL: redis://:${REDIS_PASSWORD}@redis:6379/1
      VECTOR_STORE: ${VECTOR_STORE:-weaviate}
      WEAVIATE_ENDPOINT: http://weaviate:8080
      STORAGE_TYPE: ${STORAGE_TYPE:-local}
      STORAGE_LOCAL_PATH: /app/api/storage
    volumes:
      - dify_storage:/app/api/storage
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: '4'
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "3"

  web:
    image: langgenius/dify-web:0.10.0
    restart: always
    environment:
      CONSOLE_API_URL: ${CONSOLE_WEB_URL}
      APP_API_URL: ${APP_WEB_URL}
    deploy:
      resources:
        limits:
          memory: 512M
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/"]
      interval: 30s
      timeout: 10s
      retries: 3

  db:
    image: postgres:15-alpine
    restart: always
    environment:
      POSTGRES_USER: ${DB_USERNAME}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: ${DB_DATABASE}
      # 性能调优
      POSTGRES_INITDB_ARGS: "--encoding=UTF8"
    command: >
      postgres
      -c shared_buffers=256MB
      -c max_connections=200
      -c work_mem=4MB
      -c maintenance_work_mem=64MB
      -c effective_cache_size=512MB
      -c wal_level=replica
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./postgres/init:/docker-entrypoint-initdb.d
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USERNAME} -d ${DB_DATABASE}"]
      interval: 10s
      timeout: 5s
      retries: 5
    deploy:
      resources:
        limits:
          memory: 2G

  redis:
    image: redis:7-alpine
    restart: always
    command: >
      redis-server
      --requirepass ${REDIS_PASSWORD}
      --maxmemory 512mb
      --maxmemory-policy allkeys-lru
      --save 60 1000
      --appendonly yes
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

  weaviate:
    image: semitechnologies/weaviate:1.24.1
    restart: always
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'false'
      AUTHENTICATION_APIKEY_ENABLED: 'true'
      AUTHENTICATION_APIKEY_ALLOWED_KEYS: ${WEAVIATE_API_KEY}
      AUTHENTICATION_APIKEY_USERS: dify
      PERSISTENCE_DATA_PATH: /var/lib/weaviate
      DEFAULT_VECTORIZER_MODULE: none
      ENABLE_MODULES: ''
      CLUSTER_HOSTNAME: node1
    volumes:
      - weaviate_data:/var/lib/weaviate
    deploy:
      resources:
        limits:
          memory: 4G

  sandbox:
    image: langgenius/dify-sandbox:0.2.10
    restart: always
    environment:
      API_KEY: ${SANDBOX_API_KEY}
      GIN_MODE: release
      WORKER_TIMEOUT: 15
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: '1'

volumes:
  postgres_data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/dify/postgres
  redis_data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/dify/redis
  weaviate_data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/dify/weaviate
  dify_storage:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /data/dify/storage

数据备份策略

#!/bin/bash
# /opt/dify/backup.sh — 每日备份脚本

BACKUP_DIR="/backup/dify"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30

mkdir -p "$BACKUP_DIR"

# 1. 备份 PostgreSQL
docker exec dify-db-1 pg_dump \
  -U $DB_USERNAME \
  -d $DB_DATABASE \
  --format=custom \
  --no-acl \
  > "$BACKUP_DIR/postgres_${DATE}.dump"

# 2. 备份存储文件(知识库文档等)
tar -czf "$BACKUP_DIR/storage_${DATE}.tar.gz" \
  -C /data/dify storage/

# 3. 备份 Weaviate 数据
tar -czf "$BACKUP_DIR/weaviate_${DATE}.tar.gz" \
  -C /data/dify weaviate/

# 4. 上传到远端(可选:OSS/S3)
# aws s3 sync "$BACKUP_DIR/" "s3://your-backup-bucket/dify/"

# 5. 清理过期备份
find "$BACKUP_DIR" -name "*.dump" -mtime +$RETENTION_DAYS -delete
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete

echo "[$(date)] Backup completed: $DATE"
# 添加 crontab 定时任务
crontab -e
# 每天凌晨 2:00 执行备份
0 2 * * * /opt/dify/backup.sh >> /var/log/dify-backup.log 2>&1

零停机升级流程

#!/bin/bash
# /opt/dify/upgrade.sh

NEW_VERSION=$1
if [ -z "$NEW_VERSION" ]; then
  echo "Usage: $0 <version>"
  exit 1
fi

echo "=== Dify 升级到 $NEW_VERSION ==="

# Step 1: 拉取新镜像
docker pull langgenius/dify-api:$NEW_VERSION
docker pull langgenius/dify-web:$NEW_VERSION

# Step 2: 备份数据库
./backup.sh

# Step 3: 更新镜像标签
sed -i "s/dify-api:[0-9.]*/dify-api:$NEW_VERSION/g" docker-compose.prod.yaml
sed -i "s/dify-web:[0-9.]*/dify-web:$NEW_VERSION/g" docker-compose.prod.yaml

# Step 4: 滚动重启(先 worker,再 api,最后 web)
docker compose -f docker-compose.prod.yaml up -d --no-deps worker
sleep 30

# Step 5: 执行数据库迁移
docker compose -f docker-compose.prod.yaml exec api flask db upgrade

# Step 6: 重启 api
docker compose -f docker-compose.prod.yaml up -d --no-deps api
sleep 30

# Step 7: 重启 web
docker compose -f docker-compose.prod.yaml up -d --no-deps web

echo "=== 升级完成 ==="
docker compose -f docker-compose.prod.yaml ps

常见坑点与解决方案

坑1:Weaviate 内存溢出

症状:向量搜索慢、容器频繁重启。

原因:Weaviate 默认配置没有内存上限,文档多时会耗尽宿主机内存。

解决:

# docker-compose.yaml 中添加
weaviate:
  environment:
    LIMIT_RESOURCES: 'true'
    # 或手动控制 JVM 堆大小
    JAVA_OPTS: "-Xmx2g -Xms512m"
  deploy:
    resources:
      limits:
        memory: 4G

坑2:PostgreSQL 连接数耗尽

症状:FATAL: remaining connection slots are reserved for non-replication superuser connections

原因:Dify API 多进程 + Celery Worker 同时连接,默认 100 连接不够用。

解决:

# 方案1:增加 PG 最大连接数
postgres -c max_connections=500

# 方案2(推荐):加 PgBouncer 连接池
# 见 Level 3 详细配置

坑3:SSE 流式响应被 Nginx 缓冲截断

症状:流式对话在用户端看不到流式效果,要等全部响应才出现。

解决:

location /api/ {
    proxy_buffering off;
    proxy_cache off;
    proxy_set_header Connection '';
    proxy_http_version 1.1;
    chunked_transfer_encoding on;
}

Level 3:源码与原理(5 年以上)

Kubernetes 高可用部署

对于 500 人以上企业,单机 Docker Compose 的单点故障风险无法接受,需要迁移到 Kubernetes。

集群规划(以 1000 人企业为例):

控制节点(3 台,高可用 etcd)
  - master-01: 8C/16G
  - master-02: 8C/16G  
  - master-03: 8C/16G

工作节点(按负载分组)
  - app-nodes(3台): 16C/32G — 运行 api / web / worker
  - db-nodes(2台): 32C/128G NVMe — 运行 PostgreSQL / Redis
  - vector-nodes(2台): 32C/64G — 运行 Weaviate

Namespace 规划

# namespaces.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: dify-prod
  labels:
    environment: production
---
apiVersion: v1
kind: Namespace
metadata:
  name: dify-infra
  labels:
    environment: production

Dify API Deployment

# api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dify-api
  namespace: dify-prod
  labels:
    app: dify-api
    version: "0.10.0"
spec:
  replicas: 3  # 三副本保证高可用
  selector:
    matchLabels:
      app: dify-api
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # 零停机更新
  template:
    metadata:
      labels:
        app: dify-api
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: dify-api
            topologyKey: kubernetes.io/hostname  # 强制跨节点部署
      containers:
      - name: api
        image: langgenius/dify-api:0.10.0
        env:
        - name: MODE
          value: api
        - name: SECRET_KEY
          valueFrom:
            secretKeyRef:
              name: dify-secrets
              key: secret-key
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: dify-secrets
              key: db-password
        envFrom:
        - configMapRef:
            name: dify-api-config
        ports:
        - containerPort: 5001
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        readinessProbe:
          httpGet:
            path: /health
            port: 5001
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 5001
          initialDelaySeconds: 60
          periodSeconds: 30
          failureThreshold: 5
        volumeMounts:
        - name: storage
          mountPath: /app/api/storage
      volumes:
      - name: storage
        persistentVolumeClaim:
          claimName: dify-storage-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: dify-api-svc
  namespace: dify-prod
spec:
  selector:
    app: dify-api
  ports:
  - port: 5001
    targetPort: 5001
  type: ClusterIP

HPA 自动扩缩容

# api-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: dify-api-hpa
  namespace: dify-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: dify-api
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # 缩容保守,避免抖动

Ingress 配置(带 SSL 和限速)

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dify-ingress
  namespace: dify-prod
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    # 限速:每 IP 每分钟 60 请求
    nginx.ingress.kubernetes.io/limit-rps: "60"
    nginx.ingress.kubernetes.io/limit-connections: "20"
spec:
  tls:
  - hosts:
    - dify.yourcompany.com
    secretName: dify-tls
  rules:
  - host: dify.yourcompany.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: dify-web-svc
            port:
              number: 3000
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: dify-api-svc
            port:
              number: 5001
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: dify-api-svc
            port:
              number: 5001
      - path: /console/api
        pathType: Prefix
        backend:
          service:
            name: dify-api-svc
            port:
              number: 5001

PostgreSQL 高可用:Patroni + HAProxy

单节点 PostgreSQL 是最大的单点故障。生产环境推荐使用 Patroni 实现自动故障转移:

# patroni.yaml(ConfigMap 中的配置文件)
scope: dify-postgres
namespace: /dify/
name: pg-node-1

restapi:
  listen: 0.0.0.0:8008
  connect_address: ${POD_IP}:8008

etcd3:
  hosts: etcd-0.etcd:2379,etcd-1.etcd:2379,etcd-2.etcd:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 30
    maximum_lag_on_failover: 1048576  # 1MB
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        max_connections: 500
        shared_buffers: 4GB
        effective_cache_size: 12GB
        work_mem: 16MB
        maintenance_work_mem: 512MB
        wal_level: replica
        max_wal_senders: 5
        hot_standby: on
        
  initdb:
  - encoding: UTF8
  - data-checksums

  pg_hba:
  - host replication replicator 0.0.0.0/0 md5
  - host all all 0.0.0.0/0 md5

postgresql:
  listen: 0.0.0.0:5432
  connect_address: ${POD_IP}:5432
  data_dir: /var/lib/postgresql/data/pgdata
  authentication:
    replication:
      username: replicator
      password: ${REPLICATION_PASSWORD}
    superuser:
      username: postgres
      password: ${POSTGRES_PASSWORD}

PgBouncer 连接池

# pgbouncer.ini
[databases]
dify = host=pg-primary port=5432 dbname=dify

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# 连接池模式(transaction 模式对 Dify 最合适)
pool_mode = transaction
max_client_conn = 1000  # 允许最多 1000 个客户端连接
default_pool_size = 50  # 每个数据库维护 50 个后端连接
min_pool_size = 10
reserve_pool_size = 10
reserve_pool_timeout = 5

# 超时配置
server_idle_timeout = 600
client_idle_timeout = 0
query_timeout = 0
connect_timeout = 15

# 统计日志
stats_period = 60
log_connections = 0
log_disconnections = 0

Weaviate 集群化

# weaviate-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: weaviate
  namespace: dify-infra
spec:
  serviceName: weaviate
  replicas: 3
  selector:
    matchLabels:
      app: weaviate
  template:
    metadata:
      labels:
        app: weaviate
    spec:
      containers:
      - name: weaviate
        image: semitechnologies/weaviate:1.24.1
        env:
        - name: CLUSTER_HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: CLUSTER_GOSSIP_BIND_PORT
          value: "7000"
        - name: CLUSTER_DATA_BIND_PORT
          value: "7001"
        - name: CLUSTER_JOIN
          value: "weaviate-0.weaviate,weaviate-1.weaviate,weaviate-2.weaviate"
        - name: PERSISTENCE_DATA_PATH
          value: /var/lib/weaviate
        - name: AUTHENTICATION_APIKEY_ENABLED
          value: "true"
        - name: AUTHENTICATION_APIKEY_ALLOWED_KEYS
          valueFrom:
            secretKeyRef:
              name: weaviate-secrets
              key: api-key
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        volumeMounts:
        - name: weaviate-data
          mountPath: /var/lib/weaviate
  volumeClaimTemplates:
  - metadata:
      name: weaviate-data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 200Gi

Level 4:生产陷阱与决策(专家视角)

架构选型决策树

在实际项目中,面对"用 Docker Compose 还是 K8s"这个问题,很多团队做出了错误选择。以下是基于真实案例的决策框架:

案例一:某金融科技公司(300人)

初期选择了 K8s,理由是"高可用"。结果:

正确决策:300 人以下且没有专职 K8s 运维工程师的团队,用 Docker Compose + 定期备份 + 冷备主机更务实。

案例二:某制造企业(800人)

选择了 K8s,但忽视了有状态服务(PostgreSQL、Weaviate)的复杂性,直接把数据库跑在 K8s 上。结果:

正确决策:K8s 只跑无状态的 api/web/worker,PostgreSQL 和 Weaviate 跑在独立的物理机或云数据库上(RDS/Managed Weaviate)。

高可用架构参考(真实生产)

                          ┌─────────────────────────────┐
                          │      DNS + CDN 层             │
                          │  (Cloudflare / 阿里云 CDN)   │
                          └──────────────┬──────────────┘
                                         │
                          ┌──────────────▼──────────────┐
                          │    负载均衡层(双活)         │
                          │  LB-01 ←→ LB-02 (Keepalived)│
                          │      VRRP 虚拟 IP            │
                          └──────────────┬──────────────┘
                          ┌─────────────┼─────────────┐
                          ▼             ▼             ▼
                    ┌──────────┐ ┌──────────┐ ┌──────────┐
                    │ app-01   │ │ app-02   │ │ app-03   │
                    │ api+web  │ │ api+web  │ │ api+web  │
                    └──────────┘ └──────────┘ └──────────┘
                          │             │             │
                    ┌─────────────────────────────────┐
                    │         Redis Sentinel 集群       │
                    │  master + 2 replicas + 3 sentinels│
                    └─────────────────────────────────┘
                          │
                    ┌─────────────────────────────────┐
                    │      PostgreSQL Patroni 集群      │
                    │  primary + 2 standbys (sync rep) │
                    │         HAProxy 前端              │
                    └─────────────────────────────────┘

存储选型的坑

坑:local storage 在 K8s 中不可用

Dify 默认使用本地文件存储上传的文档,但在 K8s 多副本场景下,不同 Pod 读到的文件不一致。

解决方案对比:

方案 优点 缺点 适用场景
NFS 简单,无需额外费用 性能差,单点故障 小规模测试
Ceph/GlusterFS 高可用,高性能 运维复杂 自建 K8s
阿里云 OSS/AWS S3 零运维,高可用 网络延迟,有费用 云上生产
MinIO S3 兼容,自托管 需要维护 数据不出境要求

推荐配置(MinIO + Dify)

# .env 修改存储后端
STORAGE_TYPE=s3
S3_ENDPOINT=http://minio:9000
S3_BUCKET_NAME=dify-storage
S3_ACCESS_KEY=minio-access-key
S3_SECRET_KEY=minio-secret-key
S3_REGION=us-east-1  # MinIO 兼容模式下随意填写

版本升级的危险地带

Dify 每次大版本升级(如 0.9.x → 0.10.x)都可能有 breaking change:

  1. 数据库 Schema 变更:Flask-Migrate 迁移脚本可能有 bug,升级前务必备份并在测试环境先跑
  2. 向量数据库 Schema 变更:Weaviate Collection Schema 不向后兼容时需要重建索引,耗时可能数小时
  3. 环境变量重命名:新版本可能废弃某些 env var,不设置就用默认值,导致功能静默失效

安全升级 SOP

1. 在完全相同配置的测试环境跑新版本 ≥ 24h
2. 生产备份(数据库 + 向量库 + 存储文件)
3. 开维护窗口,挂维护页
4. 执行升级脚本
5. 冒烟测试(10 分钟内覆盖核心流程)
6. 恢复维护窗口
7. 保留旧镜像 7 天用于回滚

安全加固清单

# 1. 修改默认端口(避免端口扫描直接打到服务)
# 2. 限制 Docker API 暴露
# 不要绑定 0.0.0.0
dockerd --host=unix:///var/run/docker.sock

# 3. 隔离网络:只让 nginx 暴露 80/443
# docker-compose.yaml
networks:
  frontend:
    # 只有 nginx 加入
  backend:
    internal: true  # 完全内网,不可外访

# 4. 定期扫描镜像漏洞
docker scout cves langgenius/dify-api:0.10.0

# 5. 限制容器权限
security_opt:
  - no-new-privileges:true
cap_drop:
  - ALL
read_only: true

# 6. 审计日志
# 在 .env 中启用
ENABLE_AUDIT_LOG=true

本章小结

核心要点

  1. 选型原则:< 100 人用 Docker Compose,100-500 人考虑双机热备,> 500 人上 K8s(但数据库保持独立)。

  2. 必改配置:生产环境 SECRET_KEY、数据库密码、Redis 密码必须强密码,切勿使用默认值。

  3. 高可用核心:真正的高可用需要 PostgreSQL 主备(Patroni)+ Redis Sentinel + 多副本 API。

  4. 存储要双活:K8s 多副本时必须使用共享存储(S3/MinIO/OSS),本地存储会导致数据不一致。

  5. 升级要谨慎:任何 Dify 升级前必须备份数据库,先在测试环境验证,再上生产。

  6. 安全不能省:网络隔离、最小权限、定期漏洞扫描是生产环境的基本要求。

实用命令速查

# 查看所有服务状态
docker compose ps

# 查看某服务内存使用
docker stats dify-api-1

# 进入容器调试
docker compose exec api bash

# 查看数据库连接数
docker compose exec db psql -U dify -c "SELECT count(*) FROM pg_stat_activity;"

# 强制重建某服务
docker compose up -d --force-recreate --no-deps api

# K8s 查看 Pod 状态
kubectl get pods -n dify-prod -w

# K8s 滚动重启
kubectl rollout restart deployment/dify-api -n dify-prod
本章评分
4.6  / 5  (10 评分)

💬 留言讨论