Chapter 50
Prometheus + Grafana Monitoring System
Chapter 50: Prometheus + Grafana Monitoring System
Introduction
"You can't manage what you can't measure." For Hermes Agent, this statement is especially true. Agent behavior is far more complex than traditional web services: every reasoning step consumes tokens, every tool call can fail, and every session has its own lifecycle. Without comprehensive monitoring, you're flying blind. This chapter builds a complete Prometheus + Grafana monitoring system, covering everything from basic metric collection to production-grade alerting.
50.1 Key Metrics to Monitor
Before designing the monitoring system, clarify what we care about. Hermes Agent metrics fall into five dimensions:
mindmap
root((Hermes Metrics))
Performance
Request latency P50/P95/P99
Step execution time
Tool call duration
Reliability
Request success rate
Tool call success rate
Error rate by category
Cost
Token consumption
API spend in USD
GPU utilization
Capacity
Active sessions
Queue depth
Memory usage
Business
Task completion rate
Avg steps per session
Core Metric Definitions
| Metric Name | Type | Description | Alert Threshold |
|---|---|---|---|
hermes_request_duration_seconds |
Histogram | End-to-end request latency | P95 > 30s |
hermes_step_duration_seconds |
Histogram | Single-step reasoning latency | P95 > 10s |
hermes_tool_call_total |
Counter | Tool calls by name and status | — |
hermes_tool_call_errors_total |
Counter | Tool call failures | Error rate > 5% |
hermes_tokens_total |
Counter | Tokens consumed (prompt/completion) | Daily > budget |
hermes_active_sessions |
Gauge | Currently active Agent sessions | > instance capacity |
hermes_session_steps_total |
Histogram | Steps per completed session | — |
hermes_cost_usd_total |
Counter | Cumulative API cost in USD | Daily > budget |
hermes_llm_errors_total |
Counter | LLM API error count | — |
50.2 Embedding Metrics in Hermes Agent
# hermes_metrics.py
import time
import functools
from typing import Callable
from prometheus_client import (
Counter, Histogram, Gauge,
generate_latest, CONTENT_TYPE_LATEST
)
# Metric definitions
REQUEST_DURATION = Histogram(
'hermes_request_duration_seconds',
'End-to-end request latency',
['method', 'endpoint', 'status'],
buckets=[0.5, 1, 2, 5, 10, 30, 60, 120, 300]
)
STEP_DURATION = Histogram(
'hermes_step_duration_seconds',
'Duration of a single reasoning step',
['model', 'step_type'],
buckets=[0.1, 0.5, 1, 2, 5, 10, 30]
)
TOOL_CALLS_TOTAL = Counter(
'hermes_tool_call_total',
'Total tool calls by name and status',
['tool_name', 'status']
)
TOOL_DURATION = Histogram(
'hermes_tool_duration_seconds',
'Tool execution duration',
['tool_name'],
buckets=[0.01, 0.05, 0.1, 0.5, 1, 5, 10, 30]
)
TOKENS_TOTAL = Counter(
'hermes_tokens_total',
'Total tokens consumed',
['model', 'token_type']
)
ACTIVE_SESSIONS = Gauge(
'hermes_active_sessions',
'Number of currently active sessions',
['instance']
)
SESSION_STEPS = Histogram(
'hermes_session_steps_total',
'Steps per completed session',
buckets=[1, 2, 5, 10, 20, 50, 100]
)
COST_TOTAL = Counter(
'hermes_cost_usd_total',
'Cumulative API cost in USD',
['model', 'provider']
)
LLM_ERRORS_TOTAL = Counter(
'hermes_llm_errors_total',
'LLM API errors by type',
['model', 'error_type']
)
def track_tool_call(tool_name: str):
"""Decorator: automatically track tool call success/failure/duration."""
def decorator(func: Callable):
@functools.wraps(func)
async def wrapper(*args, **kwargs):
start = time.monotonic()
try:
result = await func(*args, **kwargs)
TOOL_CALLS_TOTAL.labels(tool_name=tool_name, status='success').inc()
return result
except TimeoutError:
TOOL_CALLS_TOTAL.labels(tool_name=tool_name, status='timeout').inc()
raise
except Exception:
TOOL_CALLS_TOTAL.labels(tool_name=tool_name, status='error').inc()
raise
finally:
TOOL_DURATION.labels(tool_name=tool_name).observe(time.monotonic() - start)
return wrapper
return decorator
class MonitoredHermesAgent:
"""Hermes Agent with full Prometheus instrumentation."""
PRICING = {
"NousResearch/Hermes-3-Llama-3.1-8B": {"input": 0.0002, "output": 0.0002},
"NousResearch/Hermes-3-Llama-3.1-70B": {"input": 0.0009, "output": 0.0009},
}
def __init__(self, model: str, instance_id: str = "default"):
self.model = model
self.instance_id = instance_id
ACTIVE_SESSIONS.labels(instance=instance_id).set(0)
async def run_session(self, session_id: str, task: str) -> dict:
ACTIVE_SESSIONS.labels(instance=self.instance_id).inc()
start = time.monotonic()
steps = 0
try:
from hermes import HermesAgent, AgentConfig
agent = HermesAgent(AgentConfig(model=self.model))
async for step_result in agent.run_stream(task):
step_start = time.monotonic()
steps += 1
if hasattr(step_result, 'usage'):
pt = step_result.usage.prompt_tokens
ct = step_result.usage.completion_tokens
TOKENS_TOTAL.labels(model=self.model, token_type='prompt').inc(pt)
TOKENS_TOTAL.labels(model=self.model, token_type='completion').inc(ct)
if self.model in self.PRICING:
cost = (pt / 1000 * self.PRICING[self.model]["input"] +
ct / 1000 * self.PRICING[self.model]["output"])
COST_TOTAL.labels(model=self.model, provider='nous_research').inc(cost)
STEP_DURATION.labels(
model=self.model,
step_type=getattr(step_result, 'step_type', 'unknown')
).observe(time.monotonic() - step_start)
if step_result.is_final:
break
SESSION_STEPS.observe(steps)
REQUEST_DURATION.labels(method='POST', endpoint='/run', status='success').observe(
time.monotonic() - start
)
return {"steps": steps, "output": step_result.content}
except Exception as e:
LLM_ERRORS_TOTAL.labels(model=self.model, error_type=self._classify(e)).inc()
REQUEST_DURATION.labels(method='POST', endpoint='/run', status='error').observe(
time.monotonic() - start
)
raise
finally:
ACTIVE_SESSIONS.labels(instance=self.instance_id).dec()
def _classify(self, e: Exception) -> str:
s = str(e).lower()
if 'rate limit' in s: return 'rate_limit'
if 'context' in s: return 'context_too_long'
if 'timeout' in s: return 'timeout'
return 'api_error'
50.3 Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "hermes_alerts.yml"
- "hermes_recording_rules.yml"
scrape_configs:
- job_name: 'hermes-agent'
scrape_interval: 10s
static_configs:
- targets: ['10.0.1.10:8000', '10.0.1.11:8000', '10.0.1.12:8000']
labels:
service: 'hermes-agent'
environment: 'production'
- job_name: 'redis'
static_configs:
- targets: ['redis:9121']
- job_name: 'node'
static_configs:
- targets: ['10.0.1.10:9100', '10.0.1.11:9100', '10.0.1.12:9100']
Recording Rules (Reduce Query Load)
# hermes_recording_rules.yml
groups:
- name: hermes_recording_rules
interval: 30s
rules:
- record: hermes:tool_success_rate:5m
expr: |
sum(rate(hermes_tool_call_total{status="success"}[5m])) by (tool_name)
/ sum(rate(hermes_tool_call_total[5m])) by (tool_name)
- record: hermes:request_duration_p95:5m
expr: |
histogram_quantile(0.95,
sum(rate(hermes_request_duration_seconds_bucket[5m])) by (le, endpoint)
)
- record: hermes:cost_per_hour_usd
expr: sum(rate(hermes_cost_usd_total[1h])) by (model) * 3600
50.4 Alert Rules
# hermes_alerts.yml
groups:
- name: hermes_performance
rules:
- alert: HermesHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(hermes_request_duration_seconds_bucket[5m])) by (le)
) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Hermes Agent P95 latency exceeds 30s"
description: "P95 latency: {{ $value | humanizeDuration }}"
- alert: HermesCriticalLatency
expr: |
histogram_quantile(0.99,
sum(rate(hermes_request_duration_seconds_bucket[5m])) by (le)
) > 120
for: 2m
labels:
severity: critical
annotations:
summary: "Hermes Agent P99 latency critically high"
- name: hermes_reliability
rules:
- alert: HermesToolErrorRate
expr: |
(sum(rate(hermes_tool_call_total{status="error"}[5m])) by (tool_name)
/ sum(rate(hermes_tool_call_total[5m])) by (tool_name)) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Tool {{ $labels.tool_name }} error rate above 5%"
- alert: HermesInstanceDown
expr: up{job="hermes-agent"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Hermes Agent instance {{ $labels.instance }} is down"
- name: hermes_cost
rules:
- alert: HermesCostOverrun
expr: sum(hermes:cost_per_hour_usd) > 100
for: 15m
labels:
severity: warning
team: finance
annotations:
summary: "Hermes hourly API cost exceeds $100"
description: "Current hourly cost: ${{ $value }}"
- alert: HermesTokenSpike
expr: |
sum(rate(hermes_tokens_total[5m]))
> 2 * avg_over_time(sum(rate(hermes_tokens_total[5m]))[1h:5m])
for: 10m
labels:
severity: warning
annotations:
summary: "Abnormal token consumption spike (2x 1h average)"
50.5 Common Alert Scenarios
| Alert | Root Cause | Investigation Steps |
|---|---|---|
| High P95 latency | LLM inference bottleneck, tool timeout | Check GPU utilization, tool error logs |
| Tool error rate spike | External API down, rate limit hit | Check tool-specific logs, upstream status |
| Token spike | Prompt injection, infinite loop bug | Review recent sessions, check max_steps |
| Instance down | OOM kill, container crash | Check pod logs, kubectl describe pod |
| Cost overrun | High traffic, expensive tasks | Check active sessions, task complexity |
Summary
This chapter established a complete Prometheus + Grafana monitoring system for Hermes Agent:
- Five-dimension metrics: Performance (latency), reliability (success/error rates), cost (tokens/USD), capacity (sessions), and business (completion rate).
- Code-level instrumentation: Decorators and context managers transparently embed metric collection into Agent core code.
- Prometheus setup: Includes scrape configuration, recording rules (pre-computation reduces query load), and service discovery patterns.
- Multi-level alerting: Warning → Critical severity, routed by team (platform/finance), integrated with Alertmanager + Slack + PagerDuty.
- Grafana Dashboard: Pre-built JSON for key metrics visualization enabling rapid problem diagnosis.
Review Questions
- Beyond per-token billing, what other cost dimensions should be tracked for LLM inference?
- If Hermes Agent is deployed in an air-gapped private environment, how would you redesign the monitoring architecture?
- How do you define "success" for a tool call? Is an HTTP 200 response sufficient to declare business success?
- Alert fatigue is a common failure mode of monitoring systems. How do you design inhibition rules to prevent alert storms?