Chapter 48

Cluster Deployment: Kubernetes Architecture

Chapter 48: Cluster Deployment — Kubernetes Architecture

Introduction

Single-machine Docker handles "stable operation." Kubernetes handles "elastic scale." When Hermes Agent needs to span multiple GPU servers, auto-scale at peak load, and self-heal after failures, Kubernetes becomes non-negotiable infrastructure. This chapter delivers a complete K8s deployment blueprint: GPU resource scheduling, horizontal autoscaling, cross-node load balancing, and a principled analysis of when to use StatefulSet versus Deployment.

48.1 Cluster Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                  Kubernetes Cluster — Hermes Production                  │
│                                                                         │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │           Ingress Controller (nginx-ingress / Traefik)            │  │
│  │           TLS termination + routing                               │  │
│  └───────────────────────────────┬───────────────────────────────────┘  │
│                                  │                                      │
│  ┌───────────────────────────────▼───────────────────────────────────┐  │
│  │           hermes-agent Service (ClusterIP)                        │  │
│  │           Load balances → MCP Server Pods                         │  │
│  └──────┬──────────────────┬──────────────────┬────────────────────┘  │
│         │                  │                  │                         │
│  ┌──────▼──────┐    ┌──────▼──────┐    ┌──────▼──────┐               │
│  │hermes-agent │    │hermes-agent │    │hermes-agent │  Deployment     │
│  │   Pod 0     │    │   Pod 1     │    │   Pod N     │  (stateless)    │
│  └──────┬──────┘    └──────┬──────┘    └──────┬──────┘               │
│         └──────────────────┼──────────────────┘                       │
│                            │ internal calls                            │
│  ┌─────────────────────────▼──────────────────────────────────────┐   │
│  │           vllm-service Service (ClusterIP)                      │   │
│  │           GPU-aware load balancing                              │   │
│  └──────┬──────────────────────────────┬──────────────────────────┘   │
│         │                              │                               │
│  ┌──────▼───────────────┐    ┌─────────▼──────────────┐              │
│  │    vLLM Pod 0        │    │    vLLM Pod 1          │  StatefulSet  │
│  │  ┌────────────────┐  │    │  ┌────────────────┐   │  (stateful,   │
│  │  │ nvidia GPU 0   │  │    │  │ nvidia GPU 1   │   │   GPU-bound)  │
│  │  │ (A100 80GB)    │  │    │  │ (A100 80GB)    │   │              │
│  │  └────────────────┘  │    │  └────────────────┘   │              │
│  └──────────────────────┘    └────────────────────────┘              │
│                                                                       │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │  Persistent Storage                                              │ │
│  │  PVC: model-storage (ReadWriteMany, NFS/CephFS)                  │ │
│  │  PVC: hermes-data   (ReadWriteOnce, SSD)                         │ │
│  └──────────────────────────────────────────────────────────────────┘ │
│                                                                       │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │  Monitoring: Prometheus + Grafana + NVIDIA DCGM Exporter         │ │
│  └──────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘

48.2 Prerequisites: GPU Node Setup

# Install NVIDIA Container Toolkit on each GPU node
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
    sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Install NVIDIA Device Plugin into K8s
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

# Verify GPU resources are visible
kubectl describe nodes | grep -A 5 "nvidia.com/gpu"
# Should show: nvidia.com/gpu: 4

# Label GPU nodes for targeted scheduling
kubectl label nodes gpu-node-01 gpu=a100 gpu-memory=80g
kubectl label nodes gpu-node-02 gpu=a100 gpu-memory=80g

48.3 ConfigMap

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: hermes-config
  namespace: hermes-prod
data:
  HERMES_BASE_URL: "http://vllm-service:8000"
  HERMES_MODEL: "NousResearch/Hermes-3-Llama-3.1-70B"
  MAX_TOKENS: "4096"
  TEMPERATURE: "0.1"
  CONTEXT_WINDOW: "65536"
  REQUEST_TIMEOUT: "120"
  LOG_LEVEL: "INFO"
  LOG_FORMAT: "json"
  ENABLE_RATE_LIMITING: "true"
  MAX_REQUESTS_PER_MINUTE: "60"
  PROMETHEUS_ENABLED: "true"

48.4 Secrets

# Create secrets from files (never store real values in YAML)
kubectl create secret generic hermes-secrets \
    --from-literal=api-key="$(cat secrets/api_key.txt)" \
    --from-literal=hf-token="$(cat secrets/hf_token.txt)" \
    -n hermes-prod

# Production: use Vault / Sealed Secrets / External Secrets Operator
# base64 in Secret YAML is NOT encryption

48.5 vLLM StatefulSet (GPU Inference Layer)

# vllm-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: vllm-hermes
  namespace: hermes-prod
spec:
  serviceName: "vllm-headless"
  replicas: 2

  selector:
    matchLabels:
      app: vllm-hermes

  template:
    metadata:
      labels:
        app: vllm-hermes
        tier: inference
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8001"

    spec:
      # Spread across different physical nodes
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values: [vllm-hermes]
              topologyKey: "kubernetes.io/hostname"

        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: gpu
                    operator: In
                    values: [a100]

      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 8Gi
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-storage-pvc

      initContainers:
        - name: model-downloader
          image: python:3.11-slim
          command:
            - sh
            - -c
            - |
              pip install huggingface_hub -q
              huggingface-cli download \
                NousResearch/Hermes-3-Llama-3.1-70B \
                --token ${HF_TOKEN} \
                --local-dir /model-cache/hermes-70b \
                --ignore-patterns "*.pt"
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hermes-secrets
                  key: hf-token
          volumeMounts:
            - name: model-cache
              mountPath: /model-cache
          resources:
            requests:
              memory: "2Gi"
              cpu: "1"

      containers:
        - name: vllm
          image: vllm/vllm-openai:latest

          args:
            - --model
            - /model-cache/hermes-70b
            - --dtype
            - bfloat16
            - --max-model-len
            - "65536"
            - --gpu-memory-utilization
            - "0.90"
            - --max-num-seqs
            - "256"
            - --enable-prefix-caching
            - --tensor-parallel-size
            - "1"
            - --port
            - "8000"
            - --host
            - "0.0.0.0"

          ports:
            - name: api
              containerPort: 8000
            - name: metrics
              containerPort: 8001

          resources:
            requests:
              memory: "16Gi"
              cpu: "4"
              nvidia.com/gpu: "1"
            limits:
              memory: "96Gi"
              cpu: "16"
              nvidia.com/gpu: "1"

          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
            - name: model-cache
              mountPath: /model-cache

          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 300    # Model loading takes 5+ minutes
            periodSeconds: 15
            failureThreshold: 20

          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 360
            periodSeconds: 30
            failureThreshold: 3

---
apiVersion: v1
kind: Service
metadata:
  name: vllm-headless
  namespace: hermes-prod
spec:
  clusterIP: None
  selector:
    app: vllm-hermes

---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: hermes-prod
spec:
  type: ClusterIP
  selector:
    app: vllm-hermes
  ports:
    - name: api
      port: 8000
      targetPort: 8000
    - name: metrics
      port: 8001
      targetPort: 8001
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 300

48.6 Hermes Agent Deployment (Stateless API Layer)

# hermes-agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hermes-agent
  namespace: hermes-prod
spec:
  replicas: 3    # HPA will adjust dynamically

  selector:
    matchLabels:
      app: hermes-agent

  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 2

  template:
    metadata:
      labels:
        app: hermes-agent
        tier: api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"

    spec:
      terminationGracePeriodSeconds: 60

      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values: [hermes-agent]
                topologyKey: "kubernetes.io/hostname"

      containers:
        - name: hermes-agent
          image: your-registry/hermes-agent:1.0.0
          imagePullPolicy: Always

          ports:
            - name: mcp
              containerPort: 8765
            - name: metrics
              containerPort: 9090

          envFrom:
            - configMapRef:
                name: hermes-config

          env:
            - name: API_KEY
              valueFrom:
                secretKeyRef:
                  name: hermes-secrets
                  key: api-key
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name

          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "4Gi"
              cpu: "4000m"

          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8765
            initialDelaySeconds: 15
            periodSeconds: 10
            failureThreshold: 3

          livenessProbe:
            httpGet:
              path: /health
              port: 8765
            initialDelaySeconds: 30
            periodSeconds: 30
            failureThreshold: 3

          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]  # Allow LB route drain

---
apiVersion: v1
kind: Service
metadata:
  name: hermes-agent-service
  namespace: hermes-prod
spec:
  type: ClusterIP
  selector:
    app: hermes-agent
  ports:
    - name: mcp
      port: 8765
      targetPort: 8765
    - name: metrics
      port: 9090
      targetPort: 9090

48.7 Horizontal Pod Autoscaler (HPA)

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hermes-agent-hpa
  namespace: hermes-prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hermes-agent

  minReplicas: 2
  maxReplicas: 20

  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

    # Custom metric: request queue depth
    - type: Pods
      pods:
        metric:
          name: hermes_request_queue_length
        target:
          type: AverageValue
          averageValue: "10"

  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 3
          periodSeconds: 60
        - type: Percent
          value: 50
          periodSeconds: 60
      selectPolicy: Max

    scaleDown:
      stabilizationWindowSeconds: 300    # Wait 5 min before scale-down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

KEDA for Advanced Scaling Triggers

# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: hermes-agent-scaler
  namespace: hermes-prod
spec:
  scaleTargetRef:
    name: hermes-agent
  minReplicaCount: 2
  maxReplicaCount: 20
  pollingInterval: 15
  cooldownPeriod: 300

  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: hermes_request_queue_length
        threshold: "10"
        query: |
          avg(hermes_request_queue_length{namespace="hermes-prod"})

48.8 Ingress Configuration

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hermes-ingress
  namespace: hermes-prod
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"    # Required for streaming
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/limit-rps: "30"
    nginx.ingress.kubernetes.io/limit-connections: "20"
spec:
  tls:
    - hosts:
        - hermes-api.your-domain.com
      secretName: hermes-tls
  rules:
    - host: hermes-api.your-domain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: hermes-agent-service
                port:
                  number: 8765

48.9 Persistent Storage

# storage.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: hermes-shared-storage
provisioner: nfs.csi.k8s.io
parameters:
  server: nfs-server.internal
  share: /data/hermes
reclaimPolicy: Retain
allowVolumeExpansion: true

---
# Shared model storage (ReadWriteMany — all GPU pods can read)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-storage-pvc
  namespace: hermes-prod
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: hermes-shared-storage
  resources:
    requests:
      storage: 500Gi

---
# Application data (ReadWriteOnce — single Pod)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: hermes-data-pvc
  namespace: hermes-prod
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 50Gi

48.10 StatefulSet vs Deployment Decision Guide

When to Use Each

Dimension	StatefulSet	Deployment
State	Stateful (stable Pod identity)	Stateless
Network identity	Stable DNS (pod-0.service)	Random IP
Storage	Per-Pod PVC, survives restart	Shared or ephemeral
Scaling order	Ordered (0→1→2 up, 2→1→0 down)	Random, parallel
Pod replacement	Same name, same node	Brand new Pod
Best for	Databases, GPU inference nodes	API servers, processors

Hermes Architecture Layer Selection

Component	Type	Reason
vLLM (GPU inference)	StatefulSet	Needs stable GPU binding and identity for tensor parallel
Hermes Agent (MCP Server)	Deployment	Stateless, fast horizontal scale, no GPU
Redis (session cache)	StatefulSet	Persistent data and primary/replica relationships
Nginx (proxy)	Deployment	Stateless, replaceable at any time

48.11 Monitoring Configuration

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: hermes-alerts
  namespace: hermes-prod
spec:
  groups:
    - name: hermes.rules
      rules:
        - alert: GPUMemoryHigh
          expr: DCGM_FI_DEV_MEM_COPY_UTIL > 90
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "GPU memory >90% on {{ $labels.instance }}"

        - alert: VLLMQueueBacklog
          expr: vllm:num_requests_waiting > 100
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "vLLM queue: {{ $value }} requests waiting"

        - alert: HermesPodRestarting
          expr: |
            increase(kube_pod_container_status_restarts_total{
              namespace="hermes-prod",
              pod=~"hermes-agent.*"
            }[15m]) > 3
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} restarting frequently"

Chapter Summary

Kubernetes deployment of Hermes follows a clear two-layer architecture:

Inference layer (vLLM) → StatefulSet

Stable GPU binding: each Pod maps to a specific GPU
Stable network identity: required for tensor-parallel inter-Pod communication
Ordered scaling: prevent race conditions during model loading

API layer (Hermes Agent) → Deployment

Fully stateless: any Pod can handle any request
Fast horizontal scaling via HPA
Rolling updates with zero downtime

Key operational principles:

GPU scheduling requires NVIDIA Device Plugin + node labels + tolerations
Model files use ReadWriteMany PVC (shared across all GPU Pods)
HPA on CPU/memory metrics works for the Agent layer; custom Prometheus metrics work better for queue-based scaling
sessionAffinity: ClientIP on the vLLM Service provides soft request affinity

Review Questions

When vLLM uses tensor parallelism (multiple Pods jointly processing a single request), how does Kubernetes networking support the high-bandwidth inter-Pod communication required? How would you configure NVLink or InfiniBand within a K8s cluster?
HPA scales based on CPU utilization, but Hermes Agent's bottleneck is typically waiting for vLLM responses (CPU is idle but latency is high). Design a more accurate scaling metric that reflects true service load rather than CPU usage.
During a rolling update, the new Hermes Agent image gradually replaces old Pods. If the new version has a latent bug that only manifests after 5 minutes of traffic, can the existing readiness and liveness probes detect it? Design a more robust canary release strategy that would catch this class of failure.

Rate this chapter

4.6 / 5 (3 ratings)