Chapter 50

Prometheus + Grafana Monitoring System

Chapter 50: Prometheus + Grafana Monitoring System

Introduction

"You can't manage what you can't measure." For Hermes Agent, this statement is especially true. Agent behavior is far more complex than traditional web services: every reasoning step consumes tokens, every tool call can fail, and every session has its own lifecycle. Without comprehensive monitoring, you're flying blind. This chapter builds a complete Prometheus + Grafana monitoring system, covering everything from basic metric collection to production-grade alerting.


50.1 Key Metrics to Monitor

Before designing the monitoring system, clarify what we care about. Hermes Agent metrics fall into five dimensions:

mindmap
  root((Hermes Metrics))
    Performance
      Request latency P50/P95/P99
      Step execution time
      Tool call duration
    Reliability
      Request success rate
      Tool call success rate
      Error rate by category
    Cost
      Token consumption
      API spend in USD
      GPU utilization
    Capacity
      Active sessions
      Queue depth
      Memory usage
    Business
      Task completion rate
      Avg steps per session

Core Metric Definitions

Metric Name Type Description Alert Threshold
hermes_request_duration_seconds Histogram End-to-end request latency P95 > 30s
hermes_step_duration_seconds Histogram Single-step reasoning latency P95 > 10s
hermes_tool_call_total Counter Tool calls by name and status โ€”
hermes_tool_call_errors_total Counter Tool call failures Error rate > 5%
hermes_tokens_total Counter Tokens consumed (prompt/completion) Daily > budget
hermes_active_sessions Gauge Currently active Agent sessions > instance capacity
hermes_session_steps_total Histogram Steps per completed session โ€”
hermes_cost_usd_total Counter Cumulative API cost in USD Daily > budget
hermes_llm_errors_total Counter LLM API error count โ€”

50.2 Embedding Metrics in Hermes Agent

# hermes_metrics.py
import time
import functools
from typing import Callable
from prometheus_client import (
    Counter, Histogram, Gauge,
    generate_latest, CONTENT_TYPE_LATEST
)

# Metric definitions
REQUEST_DURATION = Histogram(
    'hermes_request_duration_seconds',
    'End-to-end request latency',
    ['method', 'endpoint', 'status'],
    buckets=[0.5, 1, 2, 5, 10, 30, 60, 120, 300]
)

STEP_DURATION = Histogram(
    'hermes_step_duration_seconds',
    'Duration of a single reasoning step',
    ['model', 'step_type'],
    buckets=[0.1, 0.5, 1, 2, 5, 10, 30]
)

TOOL_CALLS_TOTAL = Counter(
    'hermes_tool_call_total',
    'Total tool calls by name and status',
    ['tool_name', 'status']
)

TOOL_DURATION = Histogram(
    'hermes_tool_duration_seconds',
    'Tool execution duration',
    ['tool_name'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1, 5, 10, 30]
)

TOKENS_TOTAL = Counter(
    'hermes_tokens_total',
    'Total tokens consumed',
    ['model', 'token_type']
)

ACTIVE_SESSIONS = Gauge(
    'hermes_active_sessions',
    'Number of currently active sessions',
    ['instance']
)

SESSION_STEPS = Histogram(
    'hermes_session_steps_total',
    'Steps per completed session',
    buckets=[1, 2, 5, 10, 20, 50, 100]
)

COST_TOTAL = Counter(
    'hermes_cost_usd_total',
    'Cumulative API cost in USD',
    ['model', 'provider']
)

LLM_ERRORS_TOTAL = Counter(
    'hermes_llm_errors_total',
    'LLM API errors by type',
    ['model', 'error_type']
)


def track_tool_call(tool_name: str):
    """Decorator: automatically track tool call success/failure/duration."""
    def decorator(func: Callable):
        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            start = time.monotonic()
            try:
                result = await func(*args, **kwargs)
                TOOL_CALLS_TOTAL.labels(tool_name=tool_name, status='success').inc()
                return result
            except TimeoutError:
                TOOL_CALLS_TOTAL.labels(tool_name=tool_name, status='timeout').inc()
                raise
            except Exception:
                TOOL_CALLS_TOTAL.labels(tool_name=tool_name, status='error').inc()
                raise
            finally:
                TOOL_DURATION.labels(tool_name=tool_name).observe(time.monotonic() - start)
        return wrapper
    return decorator


class MonitoredHermesAgent:
    """Hermes Agent with full Prometheus instrumentation."""
    
    PRICING = {
        "NousResearch/Hermes-3-Llama-3.1-8B": {"input": 0.0002, "output": 0.0002},
        "NousResearch/Hermes-3-Llama-3.1-70B": {"input": 0.0009, "output": 0.0009},
    }
    
    def __init__(self, model: str, instance_id: str = "default"):
        self.model = model
        self.instance_id = instance_id
        ACTIVE_SESSIONS.labels(instance=instance_id).set(0)
    
    async def run_session(self, session_id: str, task: str) -> dict:
        ACTIVE_SESSIONS.labels(instance=self.instance_id).inc()
        start = time.monotonic()
        steps = 0
        
        try:
            from hermes import HermesAgent, AgentConfig
            agent = HermesAgent(AgentConfig(model=self.model))
            
            async for step_result in agent.run_stream(task):
                step_start = time.monotonic()
                steps += 1
                
                if hasattr(step_result, 'usage'):
                    pt = step_result.usage.prompt_tokens
                    ct = step_result.usage.completion_tokens
                    
                    TOKENS_TOTAL.labels(model=self.model, token_type='prompt').inc(pt)
                    TOKENS_TOTAL.labels(model=self.model, token_type='completion').inc(ct)
                    
                    if self.model in self.PRICING:
                        cost = (pt / 1000 * self.PRICING[self.model]["input"] +
                                ct / 1000 * self.PRICING[self.model]["output"])
                        COST_TOTAL.labels(model=self.model, provider='nous_research').inc(cost)
                
                STEP_DURATION.labels(
                    model=self.model,
                    step_type=getattr(step_result, 'step_type', 'unknown')
                ).observe(time.monotonic() - step_start)
                
                if step_result.is_final:
                    break
            
            SESSION_STEPS.observe(steps)
            REQUEST_DURATION.labels(method='POST', endpoint='/run', status='success').observe(
                time.monotonic() - start
            )
            return {"steps": steps, "output": step_result.content}
            
        except Exception as e:
            LLM_ERRORS_TOTAL.labels(model=self.model, error_type=self._classify(e)).inc()
            REQUEST_DURATION.labels(method='POST', endpoint='/run', status='error').observe(
                time.monotonic() - start
            )
            raise
        finally:
            ACTIVE_SESSIONS.labels(instance=self.instance_id).dec()
    
    def _classify(self, e: Exception) -> str:
        s = str(e).lower()
        if 'rate limit' in s: return 'rate_limit'
        if 'context' in s: return 'context_too_long'
        if 'timeout' in s: return 'timeout'
        return 'api_error'

50.3 Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "hermes_alerts.yml"
  - "hermes_recording_rules.yml"

scrape_configs:
  - job_name: 'hermes-agent'
    scrape_interval: 10s
    static_configs:
      - targets: ['10.0.1.10:8000', '10.0.1.11:8000', '10.0.1.12:8000']
        labels:
          service: 'hermes-agent'
          environment: 'production'

  - job_name: 'redis'
    static_configs:
      - targets: ['redis:9121']

  - job_name: 'node'
    static_configs:
      - targets: ['10.0.1.10:9100', '10.0.1.11:9100', '10.0.1.12:9100']

Recording Rules (Reduce Query Load)

# hermes_recording_rules.yml
groups:
  - name: hermes_recording_rules
    interval: 30s
    rules:
      - record: hermes:tool_success_rate:5m
        expr: |
          sum(rate(hermes_tool_call_total{status="success"}[5m])) by (tool_name)
          / sum(rate(hermes_tool_call_total[5m])) by (tool_name)
      
      - record: hermes:request_duration_p95:5m
        expr: |
          histogram_quantile(0.95,
            sum(rate(hermes_request_duration_seconds_bucket[5m])) by (le, endpoint)
          )
      
      - record: hermes:cost_per_hour_usd
        expr: sum(rate(hermes_cost_usd_total[1h])) by (model) * 3600

50.4 Alert Rules

# hermes_alerts.yml
groups:
  - name: hermes_performance
    rules:
      - alert: HermesHighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(hermes_request_duration_seconds_bucket[5m])) by (le)
          ) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Hermes Agent P95 latency exceeds 30s"
          description: "P95 latency: {{ $value | humanizeDuration }}"
      
      - alert: HermesCriticalLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(hermes_request_duration_seconds_bucket[5m])) by (le)
          ) > 120
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Hermes Agent P99 latency critically high"

  - name: hermes_reliability
    rules:
      - alert: HermesToolErrorRate
        expr: |
          (sum(rate(hermes_tool_call_total{status="error"}[5m])) by (tool_name)
          / sum(rate(hermes_tool_call_total[5m])) by (tool_name)) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tool {{ $labels.tool_name }} error rate above 5%"
      
      - alert: HermesInstanceDown
        expr: up{job="hermes-agent"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Hermes Agent instance {{ $labels.instance }} is down"

  - name: hermes_cost
    rules:
      - alert: HermesCostOverrun
        expr: sum(hermes:cost_per_hour_usd) > 100
        for: 15m
        labels:
          severity: warning
          team: finance
        annotations:
          summary: "Hermes hourly API cost exceeds $100"
          description: "Current hourly cost: ${{ $value }}"
      
      - alert: HermesTokenSpike
        expr: |
          sum(rate(hermes_tokens_total[5m]))
          > 2 * avg_over_time(sum(rate(hermes_tokens_total[5m]))[1h:5m])
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Abnormal token consumption spike (2x 1h average)"

50.5 Common Alert Scenarios

Alert Root Cause Investigation Steps
High P95 latency LLM inference bottleneck, tool timeout Check GPU utilization, tool error logs
Tool error rate spike External API down, rate limit hit Check tool-specific logs, upstream status
Token spike Prompt injection, infinite loop bug Review recent sessions, check max_steps
Instance down OOM kill, container crash Check pod logs, kubectl describe pod
Cost overrun High traffic, expensive tasks Check active sessions, task complexity

Summary

This chapter established a complete Prometheus + Grafana monitoring system for Hermes Agent:

  1. Five-dimension metrics: Performance (latency), reliability (success/error rates), cost (tokens/USD), capacity (sessions), and business (completion rate).
  2. Code-level instrumentation: Decorators and context managers transparently embed metric collection into Agent core code.
  3. Prometheus setup: Includes scrape configuration, recording rules (pre-computation reduces query load), and service discovery patterns.
  4. Multi-level alerting: Warning โ†’ Critical severity, routed by team (platform/finance), integrated with Alertmanager + Slack + PagerDuty.
  5. Grafana Dashboard: Pre-built JSON for key metrics visualization enabling rapid problem diagnosis.

Review Questions

  1. Beyond per-token billing, what other cost dimensions should be tracked for LLM inference?
  2. If Hermes Agent is deployed in an air-gapped private environment, how would you redesign the monitoring architecture?
  3. How do you define "success" for a tool call? Is an HTTP 200 response sufficient to declare business success?
  4. Alert fatigue is a common failure mode of monitoring systems. How do you design inhibition rules to prevent alert storms?
Rate this chapter
4.6  / 5  (3 ratings)

๐Ÿ’ฌ Comments