Cluster Deployment: Kubernetes Architecture
Chapter 48: Cluster Deployment — Kubernetes Architecture
Introduction
Single-machine Docker handles "stable operation." Kubernetes handles "elastic scale." When Hermes Agent needs to span multiple GPU servers, auto-scale at peak load, and self-heal after failures, Kubernetes becomes non-negotiable infrastructure. This chapter delivers a complete K8s deployment blueprint: GPU resource scheduling, horizontal autoscaling, cross-node load balancing, and a principled analysis of when to use StatefulSet versus Deployment.
48.1 Cluster Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster — Hermes Production │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ Ingress Controller (nginx-ingress / Traefik) │ │
│ │ TLS termination + routing │ │
│ └───────────────────────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────────▼───────────────────────────────────┐ │
│ │ hermes-agent Service (ClusterIP) │ │
│ │ Load balances → MCP Server Pods │ │
│ └──────┬──────────────────┬──────────────────┬────────────────────┘ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐ │
│ │hermes-agent │ │hermes-agent │ │hermes-agent │ Deployment │
│ │ Pod 0 │ │ Pod 1 │ │ Pod N │ (stateless) │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ └──────────────────┼──────────────────┘ │
│ │ internal calls │
│ ┌─────────────────────────▼──────────────────────────────────────┐ │
│ │ vllm-service Service (ClusterIP) │ │
│ │ GPU-aware load balancing │ │
│ └──────┬──────────────────────────────┬──────────────────────────┘ │
│ │ │ │
│ ┌──────▼───────────────┐ ┌─────────▼──────────────┐ │
│ │ vLLM Pod 0 │ │ vLLM Pod 1 │ StatefulSet │
│ │ ┌────────────────┐ │ │ ┌────────────────┐ │ (stateful, │
│ │ │ nvidia GPU 0 │ │ │ │ nvidia GPU 1 │ │ GPU-bound) │
│ │ │ (A100 80GB) │ │ │ │ (A100 80GB) │ │ │
│ │ └────────────────┘ │ │ └────────────────┘ │ │
│ └──────────────────────┘ └────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Persistent Storage │ │
│ │ PVC: model-storage (ReadWriteMany, NFS/CephFS) │ │
│ │ PVC: hermes-data (ReadWriteOnce, SSD) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Monitoring: Prometheus + Grafana + NVIDIA DCGM Exporter │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
48.2 Prerequisites: GPU Node Setup
# Install NVIDIA Container Toolkit on each GPU node
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Install NVIDIA Device Plugin into K8s
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
# Verify GPU resources are visible
kubectl describe nodes | grep -A 5 "nvidia.com/gpu"
# Should show: nvidia.com/gpu: 4
# Label GPU nodes for targeted scheduling
kubectl label nodes gpu-node-01 gpu=a100 gpu-memory=80g
kubectl label nodes gpu-node-02 gpu=a100 gpu-memory=80g
48.3 ConfigMap
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: hermes-config
namespace: hermes-prod
data:
HERMES_BASE_URL: "http://vllm-service:8000"
HERMES_MODEL: "NousResearch/Hermes-3-Llama-3.1-70B"
MAX_TOKENS: "4096"
TEMPERATURE: "0.1"
CONTEXT_WINDOW: "65536"
REQUEST_TIMEOUT: "120"
LOG_LEVEL: "INFO"
LOG_FORMAT: "json"
ENABLE_RATE_LIMITING: "true"
MAX_REQUESTS_PER_MINUTE: "60"
PROMETHEUS_ENABLED: "true"
48.4 Secrets
# Create secrets from files (never store real values in YAML)
kubectl create secret generic hermes-secrets \
--from-literal=api-key="$(cat secrets/api_key.txt)" \
--from-literal=hf-token="$(cat secrets/hf_token.txt)" \
-n hermes-prod
# Production: use Vault / Sealed Secrets / External Secrets Operator
# base64 in Secret YAML is NOT encryption
48.5 vLLM StatefulSet (GPU Inference Layer)
# vllm-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: vllm-hermes
namespace: hermes-prod
spec:
serviceName: "vllm-headless"
replicas: 2
selector:
matchLabels:
app: vllm-hermes
template:
metadata:
labels:
app: vllm-hermes
tier: inference
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8001"
spec:
# Spread across different physical nodes
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: [vllm-hermes]
topologyKey: "kubernetes.io/hostname"
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: gpu
operator: In
values: [a100]
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 8Gi
- name: model-cache
persistentVolumeClaim:
claimName: model-storage-pvc
initContainers:
- name: model-downloader
image: python:3.11-slim
command:
- sh
- -c
- |
pip install huggingface_hub -q
huggingface-cli download \
NousResearch/Hermes-3-Llama-3.1-70B \
--token ${HF_TOKEN} \
--local-dir /model-cache/hermes-70b \
--ignore-patterns "*.pt"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hermes-secrets
key: hf-token
volumeMounts:
- name: model-cache
mountPath: /model-cache
resources:
requests:
memory: "2Gi"
cpu: "1"
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model
- /model-cache/hermes-70b
- --dtype
- bfloat16
- --max-model-len
- "65536"
- --gpu-memory-utilization
- "0.90"
- --max-num-seqs
- "256"
- --enable-prefix-caching
- --tensor-parallel-size
- "1"
- --port
- "8000"
- --host
- "0.0.0.0"
ports:
- name: api
containerPort: 8000
- name: metrics
containerPort: 8001
resources:
requests:
memory: "16Gi"
cpu: "4"
nvidia.com/gpu: "1"
limits:
memory: "96Gi"
cpu: "16"
nvidia.com/gpu: "1"
volumeMounts:
- name: dshm
mountPath: /dev/shm
- name: model-cache
mountPath: /model-cache
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300 # Model loading takes 5+ minutes
periodSeconds: 15
failureThreshold: 20
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 360
periodSeconds: 30
failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
name: vllm-headless
namespace: hermes-prod
spec:
clusterIP: None
selector:
app: vllm-hermes
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: hermes-prod
spec:
type: ClusterIP
selector:
app: vllm-hermes
ports:
- name: api
port: 8000
targetPort: 8000
- name: metrics
port: 8001
targetPort: 8001
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 300
48.6 Hermes Agent Deployment (Stateless API Layer)
# hermes-agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: hermes-agent
namespace: hermes-prod
spec:
replicas: 3 # HPA will adjust dynamically
selector:
matchLabels:
app: hermes-agent
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 2
template:
metadata:
labels:
app: hermes-agent
tier: api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
spec:
terminationGracePeriodSeconds: 60
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: [hermes-agent]
topologyKey: "kubernetes.io/hostname"
containers:
- name: hermes-agent
image: your-registry/hermes-agent:1.0.0
imagePullPolicy: Always
ports:
- name: mcp
containerPort: 8765
- name: metrics
containerPort: 9090
envFrom:
- configMapRef:
name: hermes-config
env:
- name: API_KEY
valueFrom:
secretKeyRef:
name: hermes-secrets
key: api-key
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "4000m"
readinessProbe:
httpGet:
path: /health/ready
port: 8765
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8765
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # Allow LB route drain
---
apiVersion: v1
kind: Service
metadata:
name: hermes-agent-service
namespace: hermes-prod
spec:
type: ClusterIP
selector:
app: hermes-agent
ports:
- name: mcp
port: 8765
targetPort: 8765
- name: metrics
port: 9090
targetPort: 9090
48.7 Horizontal Pod Autoscaler (HPA)
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hermes-agent-hpa
namespace: hermes-prod
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hermes-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metric: request queue depth
- type: Pods
pods:
metric:
name: hermes_request_queue_length
target:
type: AverageValue
averageValue: "10"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 3
periodSeconds: 60
- type: Percent
value: 50
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scale-down
policies:
- type: Pods
value: 1
periodSeconds: 120
KEDA for Advanced Scaling Triggers
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: hermes-agent-scaler
namespace: hermes-prod
spec:
scaleTargetRef:
name: hermes-agent
minReplicaCount: 2
maxReplicaCount: 20
pollingInterval: 15
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: hermes_request_queue_length
threshold: "10"
query: |
avg(hermes_request_queue_length{namespace="hermes-prod"})
48.8 Ingress Configuration
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: hermes-ingress
namespace: hermes-prod
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-buffering: "off" # Required for streaming
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/limit-rps: "30"
nginx.ingress.kubernetes.io/limit-connections: "20"
spec:
tls:
- hosts:
- hermes-api.your-domain.com
secretName: hermes-tls
rules:
- host: hermes-api.your-domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: hermes-agent-service
port:
number: 8765
48.9 Persistent Storage
# storage.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: hermes-shared-storage
provisioner: nfs.csi.k8s.io
parameters:
server: nfs-server.internal
share: /data/hermes
reclaimPolicy: Retain
allowVolumeExpansion: true
---
# Shared model storage (ReadWriteMany — all GPU pods can read)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-storage-pvc
namespace: hermes-prod
spec:
accessModes:
- ReadWriteMany
storageClassName: hermes-shared-storage
resources:
requests:
storage: 500Gi
---
# Application data (ReadWriteOnce — single Pod)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: hermes-data-pvc
namespace: hermes-prod
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 50Gi
48.10 StatefulSet vs Deployment Decision Guide
When to Use Each
| Dimension | StatefulSet | Deployment |
|---|---|---|
| State | Stateful (stable Pod identity) | Stateless |
| Network identity | Stable DNS (pod-0.service) | Random IP |
| Storage | Per-Pod PVC, survives restart | Shared or ephemeral |
| Scaling order | Ordered (0→1→2 up, 2→1→0 down) | Random, parallel |
| Pod replacement | Same name, same node | Brand new Pod |
| Best for | Databases, GPU inference nodes | API servers, processors |
Hermes Architecture Layer Selection
| Component | Type | Reason |
|---|---|---|
| vLLM (GPU inference) | StatefulSet | Needs stable GPU binding and identity for tensor parallel |
| Hermes Agent (MCP Server) | Deployment | Stateless, fast horizontal scale, no GPU |
| Redis (session cache) | StatefulSet | Persistent data and primary/replica relationships |
| Nginx (proxy) | Deployment | Stateless, replaceable at any time |
48.11 Monitoring Configuration
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: hermes-alerts
namespace: hermes-prod
spec:
groups:
- name: hermes.rules
rules:
- alert: GPUMemoryHigh
expr: DCGM_FI_DEV_MEM_COPY_UTIL > 90
for: 5m
labels:
severity: warning
annotations:
summary: "GPU memory >90% on {{ $labels.instance }}"
- alert: VLLMQueueBacklog
expr: vllm:num_requests_waiting > 100
for: 2m
labels:
severity: critical
annotations:
summary: "vLLM queue: {{ $value }} requests waiting"
- alert: HermesPodRestarting
expr: |
increase(kube_pod_container_status_restarts_total{
namespace="hermes-prod",
pod=~"hermes-agent.*"
}[15m]) > 3
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} restarting frequently"
Chapter Summary
Kubernetes deployment of Hermes follows a clear two-layer architecture:
Inference layer (vLLM) → StatefulSet
- Stable GPU binding: each Pod maps to a specific GPU
- Stable network identity: required for tensor-parallel inter-Pod communication
- Ordered scaling: prevent race conditions during model loading
API layer (Hermes Agent) → Deployment
- Fully stateless: any Pod can handle any request
- Fast horizontal scaling via HPA
- Rolling updates with zero downtime
Key operational principles:
- GPU scheduling requires NVIDIA Device Plugin + node labels + tolerations
- Model files use
ReadWriteManyPVC (shared across all GPU Pods) - HPA on CPU/memory metrics works for the Agent layer; custom Prometheus metrics work better for queue-based scaling
sessionAffinity: ClientIPon the vLLM Service provides soft request affinity
Review Questions
-
When vLLM uses tensor parallelism (multiple Pods jointly processing a single request), how does Kubernetes networking support the high-bandwidth inter-Pod communication required? How would you configure NVLink or InfiniBand within a K8s cluster?
-
HPA scales based on CPU utilization, but Hermes Agent's bottleneck is typically waiting for vLLM responses (CPU is idle but latency is high). Design a more accurate scaling metric that reflects true service load rather than CPU usage.
-
During a rolling update, the new Hermes Agent image gradually replaces old Pods. If the new version has a latent bug that only manifests after 5 minutes of traffic, can the existing readiness and liveness probes detect it? Design a more robust canary release strategy that would catch this class of failure.