Load Balancing and Horizontal Scaling
Chapter 49: Load Balancing and Horizontal Scaling
Introduction
When Hermes Agent moves from a lab prototype to a production service, single-instance deployment quickly becomes the bottleneck. As concurrent requests increase, response latency climbs, error rates appear, and system stability deteriorates. This chapter systematically explains how to scale Hermes Agent into a highly available distributed service using Nginx/HAProxy reverse proxies, stateless design, and a Redis shared memory layer.
49.1 Why Scaling Agents Is Harder Than Scaling Web Services
Scaling traditional web services is relatively simple: stateless HTTP services can be arbitrarily replicated, and a load balancer distributes requests evenly. Hermes Agent introduces several critical differences:
1. Session State Persistence
During multi-step task execution, an Agent maintains substantial intermediate state:
- Current execution step (which ReAct iteration)
- Tool call history and results
- Conversation context (message list)
- User preferences and authorization tokens
If request A establishes a session on instance 1, and request B is forwarded to instance 2, instance 2 finds no context and the task either starts from scratch or fails outright.
2. Long Connections and Streaming Responses
Hermes Agent typically uses SSE (Server-Sent Events) or WebSocket to stream tokens back to the client. This requires the load balancer to maintain a stable connection for the entire duration of Agent execution—potentially several minutes.
3. Tool Call Side Effects
When an Agent calls tools (write files, send emails, query databases), these operations cannot be naively retried. If an instance crashes causing the task to be rescheduled to another instance, duplicate operations may occur.
flowchart TD
Client[Client] --> LB[Load Balancer]
LB --> A1[Agent Instance 1]
LB --> A2[Agent Instance 2]
LB --> A3[Agent Instance 3]
A1 --> Redis[(Redis Shared Memory)]
A2 --> Redis
A3 --> Redis
A1 --> Tools[Tool Layer]
A2 --> Tools
A3 --> Tools
Redis --> PG[(PostgreSQL Persistence)]
49.2 Nginx Reverse Proxy Configuration
Basic Configuration
# /etc/nginx/conf.d/hermes-agent.conf
upstream hermes_backend {
# Session affinity via ip_hash
ip_hash;
server 10.0.1.10:8000 weight=1 max_fails=3 fail_timeout=30s;
server 10.0.1.11:8000 weight=1 max_fails=3 fail_timeout=30s;
server 10.0.1.12:8000 weight=1 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 80;
server_name agent.example.com;
# Long-connection timeouts for SSE
proxy_read_timeout 300s;
proxy_send_timeout 300s;
proxy_connect_timeout 10s;
# Disable buffering (required for SSE)
proxy_buffering off;
proxy_cache off;
location /api/agent/ {
proxy_pass http://hermes_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Cookie $http_cookie;
}
location /health {
proxy_pass http://hermes_backend/health;
access_log off;
}
}
49.3 HAProxy Configuration (Recommended for Production)
HAProxy offers more mature support for long connections and health checks than open-source Nginx:
global
maxconn 50000
log /dev/log local0
daemon
defaults
log global
mode http
option httplog
timeout connect 5s
timeout client 300s
timeout server 300s
timeout tunnel 1h
frontend hermes_frontend
bind *:80
bind *:443 ssl crt /etc/ssl/certs/agent.pem
use_backend hermes_backend
backend hermes_backend
balance leastconn
cookie SERVERID insert indirect nocache
option httpchk GET /health
http-check expect status 200
server agent1 10.0.1.10:8000 check cookie agent1 inter 5s rise 2 fall 3
server agent2 10.0.1.11:8000 check cookie agent2 inter 5s rise 2 fall 3
server agent3 10.0.1.12:8000 check cookie agent3 inter 5s rise 2 fall 3
Why leastconn over roundrobin? Long-running Agent tasks create uneven load. leastconn routes each new connection to the instance with the fewest active connections, naturally balancing capacity.
49.4 Stateless Design
The fundamental solution to session affinity problems is statelessness: move all state to shared storage so any instance can handle any request.
State Classification and Storage Strategy
| State Type | Size | Update Frequency | Recommended Storage |
|---|---|---|---|
| Conversation history | ~10KB | Every turn | Redis (TTL 24h) |
| Agent execution state | ~1KB | Every step | Redis |
| Tool result cache | 1KB–10MB | Write-once | Redis + S3 |
| Long-term memory | MB–GB | Infrequent | PostgreSQL / Qdrant |
| User preferences | ~1KB | Infrequent | PostgreSQL |
Stateless Agent Implementation
# stateless_agent.py
import json
import uuid
from typing import Optional
import redis.asyncio as aioredis
from dataclasses import dataclass, asdict
@dataclass
class AgentSession:
session_id: str
messages: list
current_step: int
tool_results: dict
metadata: dict
def to_json(self) -> str:
return json.dumps(asdict(self), ensure_ascii=False)
@classmethod
def from_json(cls, data: str) -> 'AgentSession':
return cls(**json.loads(data))
class StatelessHermesAgent:
"""Stateless Hermes Agent — all state stored in Redis."""
SESSION_TTL = 86400 # 24 hours
SESSION_PREFIX = "hermes:session:"
def __init__(self, redis_url: str = "redis://localhost:6379"):
self.redis_url = redis_url
self._redis: Optional[aioredis.Redis] = None
async def _get_redis(self) -> aioredis.Redis:
if self._redis is None:
self._redis = await aioredis.from_url(
self.redis_url, encoding="utf-8", decode_responses=True
)
return self._redis
async def create_session(self, user_id: str) -> str:
session_id = f"{user_id}:{uuid.uuid4().hex}"
session = AgentSession(
session_id=session_id,
messages=[],
current_step=0,
tool_results={},
metadata={"user_id": user_id}
)
redis = await self._get_redis()
await redis.setex(
f"{self.SESSION_PREFIX}{session_id}",
self.SESSION_TTL,
session.to_json()
)
return session_id
async def load_session(self, session_id: str) -> Optional[AgentSession]:
redis = await self._get_redis()
data = await redis.get(f"{self.SESSION_PREFIX}{session_id}")
if data is None:
return None
await redis.expire(f"{self.SESSION_PREFIX}{session_id}", self.SESSION_TTL)
return AgentSession.from_json(data)
async def save_session(self, session: AgentSession) -> None:
redis = await self._get_redis()
await redis.setex(
f"{self.SESSION_PREFIX}{session.session_id}",
self.SESSION_TTL,
session.to_json()
)
async def run_step(self, session_id: str, user_message: str) -> dict:
"""
Execute one Agent reasoning step.
Any instance can handle any request by loading state from Redis.
"""
session = await self.load_session(session_id)
if session is None:
raise ValueError(f"Session {session_id} not found or expired")
# Append user message
session.messages.append({"role": "user", "content": user_message})
# Run Agent reasoning (stateless local call)
from hermes import HermesAgent, AgentConfig
agent = HermesAgent(AgentConfig())
result = await agent.step(
messages=session.messages,
tool_results=session.tool_results,
step_number=session.current_step
)
# Update state
session.messages.append({"role": "assistant", "content": result.content})
session.current_step += 1
# Persist
await self.save_session(session)
return {
"session_id": session_id,
"step": session.current_step,
"content": result.content,
"is_final": result.is_final
}
49.5 Redis Shared Memory Layer
# redis_memory.py
import redis.asyncio as aioredis
import json
from typing import Any, Optional
class HermesRedisMemory:
"""Redis shared memory layer with distributed locking."""
def __init__(self, redis_url: str):
self.redis = aioredis.from_url(redis_url)
async def append_message(self, session_id: str, message: dict) -> int:
key = f"hermes:msg:{session_id}"
length = await self.redis.rpush(key, json.dumps(message))
await self.redis.expire(key, 86400)
return length
async def get_messages(self, session_id: str) -> list:
key = f"hermes:msg:{session_id}"
items = await self.redis.lrange(key, 0, -1)
return [json.loads(item) for item in items]
async def cache_tool_result(self, tool_call_id: str, result: Any, ttl: int = 3600) -> None:
"""Cache tool result for idempotency protection."""
key = f"hermes:tool:{tool_call_id}"
await self.redis.setex(key, ttl, json.dumps(result))
async def get_tool_result(self, tool_call_id: str) -> Optional[Any]:
key = f"hermes:tool:{tool_call_id}"
data = await self.redis.get(key)
return json.loads(data) if data else None
async def acquire_session_lock(self, session_id: str, timeout: int = 30) -> bool:
"""Distributed lock to prevent concurrent session processing."""
key = f"hermes:lock:{session_id}"
result = await self.redis.set(key, "1", nx=True, ex=timeout)
return result is True
async def release_session_lock(self, session_id: str) -> None:
await self.redis.delete(f"hermes:lock:{session_id}")
49.6 Bottleneck Analysis and Solutions
Common Bottlenecks
| Bottleneck | Symptoms | Solution |
|---|---|---|
| LLM inference | High latency, GPU at 100% | Add GPU nodes, quantization (AWQ/GPTQ), request batching |
| Redis layer | High CPU, rising read/write latency | Redis Cluster sharding, read replicas |
| Network I/O | Slow tool calls, external API timeouts | Connection pooling, async concurrent calls, local cache |
| Context length | Inference slows on long tasks | Context compression, sliding window, external memory |
| Tool concurrency | Multiple tools executing serially | Parallel tool calls with asyncio.gather |
Kubernetes HPA for Dynamic Scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hermes-agent-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hermes-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: hermes_active_sessions
target:
type: AverageValue
averageValue: "50"
Summary
Scaling Hermes Agent requires addressing challenges that don't exist in simple web services: persistent session state, long-lived streaming connections, and tool call side effects. The path forward is clear:
- Session affinity via
ip_hashor sticky cookies is a quick fix but not a long-term solution. - Stateless design by externalizing all state to Redis is the correct architectural approach.
- Redis memory layer uses appropriate data structures (List for messages, Hash for tool results, String for locks) with TTL management.
- LLM inference is almost always the primary bottleneck—address it first with quantization and batching.
- Kubernetes HPA with custom metrics enables elastic scaling tied to actual Agent load.
Review Questions
- When Redis fails, what happens to a stateless Agent cluster? How would you design a graceful degradation strategy?
- For Agents requiring long-term cross-session memory, how should Redis TTL policies be designed?
- If all tool calls are idempotent, can session affinity be completely eliminated? Which tool types are inherently non-idempotent?
- Why is
leastconnpreferable toround-robinfor long-running Agent tasks?