第 48 章
集群部署:Kubernetes 方案
第48章 集群部署:Kubernetes 方案
导语
单机 Docker 解决了"稳定运行"的问题,Kubernetes 解决的是"弹性扩展"的问题。当 Hermes Agent 需要跨多台 GPU 服务器提供服务,在高峰期自动扩容、在故障时自动恢复时,K8s 是不可绕过的基础设施。本章提供完整的 K8s 部署方案,包括 GPU 资源调度、水平自动扩展、跨节点负载均衡,以及 StatefulSet vs Deployment 的选型分析。
48.1 K8s 部署架构图
┌─────────────────────────────────────────────────────────────────────────────┐
│ Kubernetes 集群(Hermes 生产环境) │
│ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Ingress Controller │ │
│ │ (nginx-ingress / traefik) TLS 终止 + 路由 │ │
│ └──────────────────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌──────────────────────────────────▼───────────────────────────────────┐ │
│ │ hermes-agent Service (ClusterIP) │ │
│ │ 负载均衡 → MCP Server Pods │ │
│ └──────┬─────────────────┬─────────────────┬────────────────────────────┘ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │ hermes-agent│ │ hermes-agent│ │ hermes-agent│ ← Deployment Pods │
│ │ Pod 0 │ │ Pod 1 │ │ Pod N │ (无状态,可任意扩) │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ │ 内部调用 │
│ ┌────────────────────────▼───────────────────────────────────────────────┐ │
│ │ vllm-service Service (ClusterIP) │ │
│ │ GPU 感知负载均衡 → vLLM Pods │ │
│ └──────┬─────────────────────────────────┬──────────────────────────────┘ │
│ │ │ │
│ ┌──────▼─────────────┐ ┌─────────▼──────────┐ │
│ │ vLLM Pod 0 │ │ vLLM Pod 1 │ ← StatefulSet Pods │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ (有状态,GPU绑定) │
│ │ │ nvidia GPU 0 │ │ │ │ nvidia GPU 1 │ │ │
│ │ │ (A100 80GB) │ │ │ │ (A100 80GB) │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │
│ └────────────────────┘ └────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 持久化存储层 │ │
│ │ PVC: model-storage (ReadWriteMany, NFS/CephFS) │ │
│ │ PVC: hermes-data (ReadWriteOnce, SSD) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 监控栈:Prometheus + Grafana + NVIDIA DCGM Exporter │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
48.2 前提条件:GPU 节点配置
安装 NVIDIA Device Plugin
# 1. 确认节点有 GPU 并安装驱动
nvidia-smi # 每个 GPU 节点都要能运行
# 2. 安装 NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# 3. 配置 Docker 运行时
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# 4. 安装 NVIDIA Device Plugin(K8s)
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
# 5. 验证 GPU 资源可见
kubectl describe nodes | grep -A 10 "Allocatable"
# 应看到:
# nvidia.com/gpu: 4 (每节点 GPU 数量)
节点标签(方便 GPU 调度)
# 为 GPU 节点添加标签
kubectl label nodes gpu-node-01 gpu=a100 gpu-memory=80g
kubectl label nodes gpu-node-02 gpu=a100 gpu-memory=80g
# 验证
kubectl get nodes --show-labels | grep gpu=
48.3 ConfigMap 配置
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: hermes-config
namespace: hermes-prod
data:
# Hermes Agent 配置
HERMES_BASE_URL: "http://vllm-service:8000"
HERMES_MODEL: "NousResearch/Hermes-3-Llama-3.1-70B"
MAX_TOKENS: "4096"
TEMPERATURE: "0.1"
CONTEXT_WINDOW: "65536"
REQUEST_TIMEOUT: "120"
LOG_LEVEL: "INFO"
LOG_FORMAT: "json"
# 功能配置
ENABLE_RATE_LIMITING: "true"
MAX_REQUESTS_PER_MINUTE: "60"
PROMETHEUS_ENABLED: "true"
# vLLM 服务配置
vllm-config.yaml: |
model: NousResearch/Hermes-3-Llama-3.1-70B
dtype: bfloat16
max_model_len: 65536
gpu_memory_utilization: 0.90
tensor_parallel_size: 1
max_num_seqs: 256
enable_prefix_caching: true
---
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
namespace: hermes-prod
data:
nginx.conf: |
upstream hermes_backend {
least_conn;
server hermes-agent-service:8765;
keepalive 32;
}
server {
listen 80;
location /health {
access_log off;
proxy_pass http://hermes_backend;
}
location / {
proxy_pass http://hermes_backend;
proxy_buffering off;
proxy_read_timeout 300s;
}
}
48.4 Secret 配置
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: hermes-secrets
namespace: hermes-prod
type: Opaque
# 注意:生产环境使用 Vault / Sealed Secrets / External Secrets Operator
# 不要直接存储明文(base64 不是加密!)
stringData:
api-key: "your-api-key-here" # 替换为实际值
hf-token: "hf_your_huggingface_token" # HuggingFace 模型下载 token
# 使用 kubectl 创建(不暴露在 YAML 文件中)
kubectl create secret generic hermes-secrets \
--from-literal=api-key="$(cat secrets/api_key.txt)" \
--from-literal=hf-token="$(cat secrets/hf_token.txt)" \
-n hermes-prod
48.5 vLLM StatefulSet(GPU 推理层)
# vllm-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: vllm-hermes
namespace: hermes-prod
labels:
app: vllm-hermes
tier: inference
spec:
serviceName: "vllm-headless" # 对应 Headless Service
replicas: 2 # 2 个 GPU 节点
selector:
matchLabels:
app: vllm-hermes
template:
metadata:
labels:
app: vllm-hermes
tier: inference
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8001" # vLLM metrics 端口
spec:
# 亲和性:确保 Pod 分布在不同物理节点
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: [vllm-hermes]
topologyKey: "kubernetes.io/hostname"
# 优先调度到 A100 GPU 节点
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: gpu
operator: In
values: [a100]
# 容忍 GPU 节点的污点
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
# 共享内存(vLLM 需要)
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 8Gi
- name: model-cache
persistentVolumeClaim:
claimName: model-storage-pvc
# 初始化容器:预热模型缓存
initContainers:
- name: model-downloader
image: python:3.11-slim
command:
- sh
- -c
- |
pip install huggingface_hub -q
huggingface-cli download \
NousResearch/Hermes-3-Llama-3.1-70B \
--token ${HF_TOKEN} \
--local-dir /model-cache/hermes-70b \
--ignore-patterns "*.pt" "*.safetensors.index.json"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hermes-secrets
key: hf-token
volumeMounts:
- name: model-cache
mountPath: /model-cache
resources:
requests:
memory: "2Gi"
cpu: "1"
containers:
- name: vllm
image: vllm/vllm-openai:latest
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- /model-cache/hermes-70b
- --dtype
- bfloat16
- --max-model-len
- "65536"
- --gpu-memory-utilization
- "0.90"
- --max-num-seqs
- "256"
- --enable-prefix-caching
- --tensor-parallel-size
- "1"
- --port
- "8000"
- --host
- "0.0.0.0"
ports:
- name: api
containerPort: 8000
- name: metrics
containerPort: 8001
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: NCCL_SOCKET_IFNAME
value: eth0
resources:
requests:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: "1" # 每个 Pod 1 块 GPU
limits:
memory: "96Gi" # A100 80GB + 系统内存
cpu: "16"
nvidia.com/gpu: "1"
volumeMounts:
- name: dshm
mountPath: /dev/shm # 共享内存(多进程通信)
- name: model-cache
mountPath: /model-cache
# 就绪探针(等待模型加载完成)
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300 # 模型加载需要 5 分钟+
periodSeconds: 15
timeoutSeconds: 10
failureThreshold: 20
# 存活探针
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 360
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
# 数据卷声明(StatefulSet 专用)
volumeClaimTemplates: [] # 使用共享 PVC,不使用 StatefulSet 独立 PVC
---
# Headless Service(StatefulSet 需要)
apiVersion: v1
kind: Service
metadata:
name: vllm-headless
namespace: hermes-prod
spec:
clusterIP: None
selector:
app: vllm-hermes
---
# 对外暴露的 ClusterIP Service
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: hermes-prod
labels:
app: vllm-hermes
spec:
type: ClusterIP
selector:
app: vllm-hermes
ports:
- name: api
port: 8000
targetPort: 8000
- name: metrics
port: 8001
targetPort: 8001
sessionAffinity: ClientIP # 同一客户端粘性会话
sessionAffinityConfig:
clientIP:
timeoutSeconds: 300
48.6 Hermes Agent Deployment(无状态层)
# hermes-agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: hermes-agent
namespace: hermes-prod
labels:
app: hermes-agent
tier: api
spec:
replicas: 3 # 初始 3 个副本(HPA 会动态调整)
selector:
matchLabels:
app: hermes-agent
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # 最多 1 个不可用
maxSurge: 2 # 最多多起 2 个新 Pod
template:
metadata:
labels:
app: hermes-agent
tier: api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
serviceAccountName: hermes-agent-sa
# 优雅关闭等待时间(完成正在处理的请求)
terminationGracePeriodSeconds: 60
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: [hermes-agent]
topologyKey: "kubernetes.io/hostname"
containers:
- name: hermes-agent
image: your-registry/hermes-agent:1.0.0
imagePullPolicy: Always
ports:
- name: mcp
containerPort: 8765
- name: metrics
containerPort: 9090
envFrom:
- configMapRef:
name: hermes-config
env:
- name: API_KEY
valueFrom:
secretKeyRef:
name: hermes-secrets
key: api-key
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "4000m"
# 就绪探针
readinessProbe:
httpGet:
path: /health/ready
port: 8765
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# 存活探针
livenessProbe:
httpGet:
path: /health
port: 8765
initialDelaySeconds: 30
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
# 生命周期钩子(优雅关闭)
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # 等待 LB 更新路由
volumeMounts:
- name: logs
mountPath: /app/logs
- name: tmp
mountPath: /tmp
volumes:
- name: logs
emptyDir: {}
- name: tmp
emptyDir:
medium: Memory
sizeLimit: 512Mi
---
# Hermes Agent Service
apiVersion: v1
kind: Service
metadata:
name: hermes-agent-service
namespace: hermes-prod
spec:
type: ClusterIP
selector:
app: hermes-agent
ports:
- name: mcp
port: 8765
targetPort: 8765
- name: metrics
port: 9090
targetPort: 9090
48.7 水平自动扩展(HPA)
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hermes-agent-hpa
namespace: hermes-prod
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hermes-agent
minReplicas: 2 # 最少 2 个副本(高可用保底)
maxReplicas: 20 # 最多 20 个副本
metrics:
# CPU 使用率触发
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # CPU 超过 70% 时扩容
# 内存使用率触发
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# 自定义指标触发(基于请求队列长度)
- type: Pods
pods:
metric:
name: hermes_request_queue_length # Prometheus 自定义指标
target:
type: AverageValue
averageValue: "10" # 每 Pod 平均队列 10 时扩容
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # 扩容决策稳定窗口(秒)
policies:
- type: Pods
value: 3 # 每次最多增加 3 个 Pod
periodSeconds: 60
- type: Percent
value: 50 # 或每次最多增加 50%
periodSeconds: 60
selectPolicy: Max # 选择最激进的扩容策略
scaleDown:
stabilizationWindowSeconds: 300 # 缩容前等待 5 分钟(防止抖动)
policies:
- type: Pods
value: 1 # 每次最多减少 1 个 Pod
periodSeconds: 120
自定义指标适配器(KEDA)
# keda-scaledobject.yaml
# KEDA 提供比原生 HPA 更丰富的扩展触发器
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: hermes-agent-scaler
namespace: hermes-prod
spec:
scaleTargetRef:
name: hermes-agent
minReplicaCount: 2
maxReplicaCount: 20
pollingInterval: 15 # 每 15 秒检查一次
cooldownPeriod: 300 # 缩容冷却 300 秒
triggers:
# 基于 Prometheus 指标
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: hermes_request_queue_length
threshold: "10"
query: |
avg(hermes_request_queue_length{namespace="hermes-prod"})
# 基于 RabbitMQ 队列长度(如果使用消息队列)
# - type: rabbitmq
# metadata:
# host: amqp://rabbitmq:5672
# queueName: hermes-tasks
# queueLength: "50"
48.8 Ingress 配置
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: hermes-ingress
namespace: hermes-prod
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/proxy-buffering: "off" # 流式响应必须关闭缓冲
cert-manager.io/cluster-issuer: "letsencrypt-prod" # 自动 TLS
# 速率限制
nginx.ingress.kubernetes.io/limit-rps: "30"
nginx.ingress.kubernetes.io/limit-connections: "20"
spec:
tls:
- hosts:
- hermes-api.your-domain.com
secretName: hermes-tls
rules:
- host: hermes-api.your-domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: hermes-agent-service
port:
number: 8765
48.9 持久化存储配置
# storage.yaml
# StorageClass(使用 NFS 或 Ceph 共享存储)
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: hermes-shared-storage
provisioner: nfs.csi.k8s.io
parameters:
server: nfs-server.internal
share: /data/hermes
reclaimPolicy: Retain
allowVolumeExpansion: true
---
# PVC:模型存储(共享,只读)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage-pvc
namespace: hermes-prod
spec:
accessModes:
- ReadWriteMany # 多节点同时读取
storageClassName: hermes-shared-storage
resources:
requests:
storage: 500Gi # 存放多个模型的空间
---
# PVC:应用数据(各 Pod 独立)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: hermes-data-pvc
namespace: hermes-prod
spec:
accessModes:
- ReadWriteOnce # 单 Pod 读写
storageClassName: fast-ssd # 使用 SSD StorageClass
resources:
requests:
storage: 50Gi
48.10 StatefulSet vs Deployment 选型指南
决策矩阵
| 维度 | StatefulSet | Deployment |
|---|---|---|
| 状态 | 有状态(Pod 有稳定标识) | 无状态 |
| 网络标识 | 稳定的 DNS 名(pod-0.service) | 随机 IP |
| 存储 | 每 Pod 独立 PVC,重启不丢失 | 共享或临时存储 |
| 扩缩容顺序 | 有序(0→1→2 扩,2→1→0 缩) | 随机并发 |
| Pod 替换 | 相同名称替换 | 全新 Pod |
| 适合场景 | 数据库、缓存、GPU 推理节点 | API Server、处理器 |
在 Hermes 架构中的选型
Hermes 集群的层次选型:
┌──────────────────┬─────────────┬───────────────────────────────┐
│ 组件 │ 类型 │ 理由 │
├──────────────────┼─────────────┼───────────────────────────────┤
│ vLLM (GPU推理) │ StatefulSet │ 需要稳定的 GPU 绑定和 Pod 标识 │
│ │ │ 张量并行需要固定节点通信 │
├──────────────────┼─────────────┼───────────────────────────────┤
│ Hermes Agent │ Deployment │ 无状态,可任意扩缩 │
│ (MCP Server) │ │ 无 GPU 需求,快速水平扩展 │
├──────────────────┼─────────────┼───────────────────────────────┤
│ Redis (会话缓存) │ StatefulSet │ 有持久化数据和主从关系 │
├──────────────────┼─────────────┼───────────────────────────────┤
│ Nginx (代理) │ Deployment │ 无状态,随时可替换 │
└──────────────────┴─────────────┴───────────────────────────────┘
48.11 监控配置
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: hermes-prod
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
# Hermes Agent 指标
- job_name: 'hermes-agent'
kubernetes_sd_configs:
- role: pod
namespaces:
names: [hermes-prod]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: hermes-agent
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
target_label: __metrics_path__
replacement: /metrics
# vLLM 指标
- job_name: 'vllm'
static_configs:
- targets: ['vllm-service:8001']
# NVIDIA GPU 指标
- job_name: 'nvidia-dcgm'
static_configs:
- targets: ['dcgm-exporter:9400']
关键告警规则
# alerting-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: hermes-alerts
namespace: hermes-prod
spec:
groups:
- name: hermes.rules
rules:
# GPU 内存告警
- alert: GPUMemoryHigh
expr: |
DCGM_FI_DEV_MEM_COPY_UTIL > 90
for: 5m
labels:
severity: warning
annotations:
summary: "GPU memory usage too high on {{ $labels.instance }}"
# vLLM 请求队列积压
- alert: VLLMQueueBacklog
expr: |
vllm:num_requests_waiting > 100
for: 2m
labels:
severity: critical
annotations:
summary: "vLLM request queue backlog: {{ $value }} requests waiting"
# Hermes Agent Pod 重启
- alert: HermesPodRestarting
expr: |
increase(kube_pod_container_status_restarts_total{
namespace="hermes-prod",
pod=~"hermes-agent.*"
}[15m]) > 3
labels:
severity: warning
annotations:
summary: "Hermes Agent pod {{ $labels.pod }} restarting frequently"
本章小结
Kubernetes 部署 Hermes 的架构关键:
-
分层设计:
- GPU 推理层(vLLM)→ StatefulSet(状态性)
- MCP Server 层(Hermes Agent)→ Deployment(无状态)
-
GPU 调度:NVIDIA Device Plugin 是基础;节点标签 + 节点亲和性实现精确调度
-
弹性扩展:
- Hermes Agent(CPU 密集)→ HPA + CPU/内存指标
- vLLM(GPU 绑定)→ 手动或 KEDA + 自定义指标
-
存储策略:
- 模型文件 → ReadWriteMany PVC(共享)
- 应用数据 → ReadWriteOnce PVC(隔离)
-
可观测性:Prometheus + DCGM Exporter + 自定义告警覆盖 GPU 利用率、队列积压、Pod 异常
思考题
-
vLLM 使用 StatefulSet 部署时,如果需要做张量并行(多个 Pod 共同推理一个请求),K8s 的网络模型如何支持 Pod 间的高速通信?NVLink 和 InfiniBand 在 K8s 中如何配置?
-
HPA 的扩容决策基于 CPU 使用率,但 Hermes Agent 的瓶颈通常是等待 vLLM 响应(CPU 空闲但 RT 很高)。如何设计更合理的扩容指标来真实反映服务负载?
-
在 K8s 滚动更新过程中,新版 Hermes Agent 镜像正在逐步替换旧 Pod。如果新版本有一个需要 5 分钟才能暴露的 Bug,现有的就绪探针和存活探针能检测到吗?如何设计更鲁棒的金丝雀发布策略?