Chapter 49

Load Balancing and Horizontal Scaling

Chapter 49: Load Balancing and Horizontal Scaling

Introduction

When Hermes Agent moves from a lab prototype to a production service, single-instance deployment quickly becomes the bottleneck. As concurrent requests increase, response latency climbs, error rates appear, and system stability deteriorates. This chapter systematically explains how to scale Hermes Agent into a highly available distributed service using Nginx/HAProxy reverse proxies, stateless design, and a Redis shared memory layer.

49.1 Why Scaling Agents Is Harder Than Scaling Web Services

Scaling traditional web services is relatively simple: stateless HTTP services can be arbitrarily replicated, and a load balancer distributes requests evenly. Hermes Agent introduces several critical differences:

1. Session State Persistence

During multi-step task execution, an Agent maintains substantial intermediate state:

Current execution step (which ReAct iteration)
Tool call history and results
Conversation context (message list)
User preferences and authorization tokens

If request A establishes a session on instance 1, and request B is forwarded to instance 2, instance 2 finds no context and the task either starts from scratch or fails outright.

2. Long Connections and Streaming Responses

Hermes Agent typically uses SSE (Server-Sent Events) or WebSocket to stream tokens back to the client. This requires the load balancer to maintain a stable connection for the entire duration of Agent execution—potentially several minutes.

3. Tool Call Side Effects

When an Agent calls tools (write files, send emails, query databases), these operations cannot be naively retried. If an instance crashes causing the task to be rescheduled to another instance, duplicate operations may occur.

flowchart TD
    Client[Client] --> LB[Load Balancer]
    LB --> A1[Agent Instance 1]
    LB --> A2[Agent Instance 2]
    LB --> A3[Agent Instance 3]
    A1 --> Redis[(Redis Shared Memory)]
    A2 --> Redis
    A3 --> Redis
    A1 --> Tools[Tool Layer]
    A2 --> Tools
    A3 --> Tools
    Redis --> PG[(PostgreSQL Persistence)]

49.2 Nginx Reverse Proxy Configuration

Basic Configuration

# /etc/nginx/conf.d/hermes-agent.conf

upstream hermes_backend {
    # Session affinity via ip_hash
    ip_hash;
    
    server 10.0.1.10:8000 weight=1 max_fails=3 fail_timeout=30s;
    server 10.0.1.11:8000 weight=1 max_fails=3 fail_timeout=30s;
    server 10.0.1.12:8000 weight=1 max_fails=3 fail_timeout=30s;
    
    keepalive 32;
}

server {
    listen 80;
    server_name agent.example.com;
    
    # Long-connection timeouts for SSE
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
    proxy_connect_timeout 10s;
    
    # Disable buffering (required for SSE)
    proxy_buffering off;
    proxy_cache off;
    
    location /api/agent/ {
        proxy_pass http://hermes_backend;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Cookie $http_cookie;
    }
    
    location /health {
        proxy_pass http://hermes_backend/health;
        access_log off;
    }
}

49.3 HAProxy Configuration (Recommended for Production)

HAProxy offers more mature support for long connections and health checks than open-source Nginx:

global
    maxconn 50000
    log /dev/log local0
    daemon

defaults
    log     global
    mode    http
    option  httplog
    timeout connect 5s
    timeout client  300s
    timeout server  300s
    timeout tunnel  1h

frontend hermes_frontend
    bind *:80
    bind *:443 ssl crt /etc/ssl/certs/agent.pem
    use_backend hermes_backend

backend hermes_backend
    balance leastconn
    cookie SERVERID insert indirect nocache
    
    option httpchk GET /health
    http-check expect status 200
    
    server agent1 10.0.1.10:8000 check cookie agent1 inter 5s rise 2 fall 3
    server agent2 10.0.1.11:8000 check cookie agent2 inter 5s rise 2 fall 3
    server agent3 10.0.1.12:8000 check cookie agent3 inter 5s rise 2 fall 3

Why leastconn over roundrobin? Long-running Agent tasks create uneven load. leastconn routes each new connection to the instance with the fewest active connections, naturally balancing capacity.

49.4 Stateless Design

The fundamental solution to session affinity problems is statelessness: move all state to shared storage so any instance can handle any request.

State Classification and Storage Strategy

State Type	Size	Update Frequency	Recommended Storage
Conversation history	~10KB	Every turn	Redis (TTL 24h)
Agent execution state	~1KB	Every step	Redis
Tool result cache	1KB–10MB	Write-once	Redis + S3
Long-term memory	MB–GB	Infrequent	PostgreSQL / Qdrant
User preferences	~1KB	Infrequent	PostgreSQL

Stateless Agent Implementation

# stateless_agent.py
import json
import uuid
from typing import Optional
import redis.asyncio as aioredis
from dataclasses import dataclass, asdict

@dataclass
class AgentSession:
    session_id: str
    messages: list
    current_step: int
    tool_results: dict
    metadata: dict
    
    def to_json(self) -> str:
        return json.dumps(asdict(self), ensure_ascii=False)
    
    @classmethod
    def from_json(cls, data: str) -> 'AgentSession':
        return cls(**json.loads(data))


class StatelessHermesAgent:
    """Stateless Hermes Agent — all state stored in Redis."""
    
    SESSION_TTL = 86400  # 24 hours
    SESSION_PREFIX = "hermes:session:"
    
    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis_url = redis_url
        self._redis: Optional[aioredis.Redis] = None
    
    async def _get_redis(self) -> aioredis.Redis:
        if self._redis is None:
            self._redis = await aioredis.from_url(
                self.redis_url, encoding="utf-8", decode_responses=True
            )
        return self._redis
    
    async def create_session(self, user_id: str) -> str:
        session_id = f"{user_id}:{uuid.uuid4().hex}"
        session = AgentSession(
            session_id=session_id,
            messages=[],
            current_step=0,
            tool_results={},
            metadata={"user_id": user_id}
        )
        redis = await self._get_redis()
        await redis.setex(
            f"{self.SESSION_PREFIX}{session_id}",
            self.SESSION_TTL,
            session.to_json()
        )
        return session_id
    
    async def load_session(self, session_id: str) -> Optional[AgentSession]:
        redis = await self._get_redis()
        data = await redis.get(f"{self.SESSION_PREFIX}{session_id}")
        if data is None:
            return None
        await redis.expire(f"{self.SESSION_PREFIX}{session_id}", self.SESSION_TTL)
        return AgentSession.from_json(data)
    
    async def save_session(self, session: AgentSession) -> None:
        redis = await self._get_redis()
        await redis.setex(
            f"{self.SESSION_PREFIX}{session.session_id}",
            self.SESSION_TTL,
            session.to_json()
        )
    
    async def run_step(self, session_id: str, user_message: str) -> dict:
        """
        Execute one Agent reasoning step.
        Any instance can handle any request by loading state from Redis.
        """
        session = await self.load_session(session_id)
        if session is None:
            raise ValueError(f"Session {session_id} not found or expired")
        
        # Append user message
        session.messages.append({"role": "user", "content": user_message})
        
        # Run Agent reasoning (stateless local call)
        from hermes import HermesAgent, AgentConfig
        agent = HermesAgent(AgentConfig())
        result = await agent.step(
            messages=session.messages,
            tool_results=session.tool_results,
            step_number=session.current_step
        )
        
        # Update state
        session.messages.append({"role": "assistant", "content": result.content})
        session.current_step += 1
        
        # Persist
        await self.save_session(session)
        
        return {
            "session_id": session_id,
            "step": session.current_step,
            "content": result.content,
            "is_final": result.is_final
        }

49.5 Redis Shared Memory Layer

# redis_memory.py
import redis.asyncio as aioredis
import json
from typing import Any, Optional

class HermesRedisMemory:
    """Redis shared memory layer with distributed locking."""
    
    def __init__(self, redis_url: str):
        self.redis = aioredis.from_url(redis_url)
    
    async def append_message(self, session_id: str, message: dict) -> int:
        key = f"hermes:msg:{session_id}"
        length = await self.redis.rpush(key, json.dumps(message))
        await self.redis.expire(key, 86400)
        return length
    
    async def get_messages(self, session_id: str) -> list:
        key = f"hermes:msg:{session_id}"
        items = await self.redis.lrange(key, 0, -1)
        return [json.loads(item) for item in items]
    
    async def cache_tool_result(self, tool_call_id: str, result: Any, ttl: int = 3600) -> None:
        """Cache tool result for idempotency protection."""
        key = f"hermes:tool:{tool_call_id}"
        await self.redis.setex(key, ttl, json.dumps(result))
    
    async def get_tool_result(self, tool_call_id: str) -> Optional[Any]:
        key = f"hermes:tool:{tool_call_id}"
        data = await self.redis.get(key)
        return json.loads(data) if data else None
    
    async def acquire_session_lock(self, session_id: str, timeout: int = 30) -> bool:
        """Distributed lock to prevent concurrent session processing."""
        key = f"hermes:lock:{session_id}"
        result = await self.redis.set(key, "1", nx=True, ex=timeout)
        return result is True
    
    async def release_session_lock(self, session_id: str) -> None:
        await self.redis.delete(f"hermes:lock:{session_id}")

49.6 Bottleneck Analysis and Solutions

Common Bottlenecks

Bottleneck	Symptoms	Solution
LLM inference	High latency, GPU at 100%	Add GPU nodes, quantization (AWQ/GPTQ), request batching
Redis layer	High CPU, rising read/write latency	Redis Cluster sharding, read replicas
Network I/O	Slow tool calls, external API timeouts	Connection pooling, async concurrent calls, local cache
Context length	Inference slows on long tasks	Context compression, sliding window, external memory
Tool concurrency	Multiple tools executing serially	Parallel tool calls with `asyncio.gather`

Kubernetes HPA for Dynamic Scaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hermes-agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hermes-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: hermes_active_sessions
      target:
        type: AverageValue
        averageValue: "50"

Summary

Scaling Hermes Agent requires addressing challenges that don't exist in simple web services: persistent session state, long-lived streaming connections, and tool call side effects. The path forward is clear:

Session affinity via ip_hash or sticky cookies is a quick fix but not a long-term solution.
Stateless design by externalizing all state to Redis is the correct architectural approach.
Redis memory layer uses appropriate data structures (List for messages, Hash for tool results, String for locks) with TTL management.
LLM inference is almost always the primary bottleneck—address it first with quantization and batching.
Kubernetes HPA with custom metrics enables elastic scaling tied to actual Agent load.

Review Questions

When Redis fails, what happens to a stateless Agent cluster? How would you design a graceful degradation strategy?
For Agents requiring long-term cross-session memory, how should Redis TTL policies be designed?
If all tool calls are idempotent, can session affinity be completely eliminated? Which tool types are inherently non-idempotent?
Why is leastconn preferable to round-robin for long-running Agent tasks?

Rate this chapter

4.7 / 5 (3 ratings)