Chapter 41

Monitoring and Observability: Structured Logs, Health Checks and Alerting Config

Chapter 41 Monitoring and Observability: Structured Logging, Health Checks, and Alert Configuration

41.1 The Observability Challenge for AI Agents

Traditional software behaves predictably: given the same input, the output is stable and consistent. AI Agents work fundamentally differently — their decisions are driven by large language models, meaning that the same instruction can take completely different tool-calling paths at different times and in different contexts. This non-determinism renders conventional monitoring approaches severely inadequate for Agent workloads.

41.1.1 Why AI Agents Are Harder to Monitor

Path explosion: A simple instruction like "clean up my inbox" might trigger anywhere from 3 to 20 tool calls, depending on the current state of the mailbox, model version, and context window length. You can't monitor a fixed call chain the way you would with a traditional API.

Semantic ambiguity of errors: When a tool returns HTTP 200 but the content says "no relevant data found," the Agent might handle it gracefully — or it might enter an infinite retry loop. Traditional error-rate metrics can't capture these semantic-level failures.

Temporal scale variation: A single Session might last 30 seconds (a simple query) or 4 hours (a complex refactoring task). Fixed time-window monitoring metrics have limited value in this context.

Unpredictable resource consumption: LLM token usage and Context Compaction frequency directly impact cost and latency, yet both are difficult to estimate before a task begins.

Irreversible tool side effects: Agent actions like deleting files, sending emails, or deploying services are one-directional. Monitoring must alert before problems occur — not just record them afterward.

41.1.2 OpenClaw's Observability Design Philosophy

OpenClaw adopts a structured-log-first design: all runtime events are emitted in JSON format, with each field carrying a fixed semantic meaning that makes machine parsing and aggregation straightforward. Health check endpoints provide real-time status snapshots rather than relying on the latency of log aggregation. Alert rules are built on business semantics (consecutive failures, abnormal Sessions) rather than simple technical thresholds (CPU > 80%).

These three layers together form a complete observability system for AI Agents: logs (what happened), health checks (what the current state is), and alerts (what requires immediate action).


41.2 Logging Configuration in Detail

OpenClaw's logging behavior is controlled via the logging section of openclaw.json.

41.2.1 Complete Configuration Structure

{
  "logging": {
    "level": "info",
    "format": "json",
    "output": {
      "console": {
        "enabled": true,
        "colorize": false
      },
      "file": {
        "enabled": true,
        "path": "~/.openclaw/logs/openclaw.log",
        "rotation": {
          "maxSize": "100MB",
          "maxFiles": 7,
          "compress": true
        }
      },
      "syslog": {
        "enabled": false,
        "facility": "local0",
        "host": "127.0.0.1",
        "port": 514
      }
    },
    "fields": {
      "service": "openclaw",
      "environment": "production",
      "version": "${OPENCLAW_VERSION}"
    },
    "sampling": {
      "enabled": false,
      "rate": 0.1,
      "excludeLevels": ["error", "warn"]
    }
  }
}

41.2.2 Choosing the Right Log Level

Level Recommended Use Approximate Volume
debug Local development, active troubleshooting Very high (10–50 entries per tool call)
info Production default, records key events Moderate (3–8 entries per tool call)
warn Only surface potential issues Low (exceptional conditions only)
error Only record failures Very low (alert-only use)

Production recommendation: Use info by default. When debugging a specific Session, you can switch levels dynamically without restarting the Gateway:

# Temporarily upgrade to debug for full tool-call details
openclaw config set logging.level debug

# Enable debug only for a specific Agent
openclaw config set logging.agentFilter "my-agent-id:debug"

# Restore default after debugging
openclaw config set logging.level info

41.2.3 Log Output Path Planning

Recommended production directory structure:

~/.openclaw/logs/
├── openclaw.log           # Active log file (rolling writes)
├── openclaw.2026-04-25.gz # Compressed archive (auto-generated)
├── openclaw.2026-04-24.gz
└── gateway-access.log     # Gateway HTTP access log (separate)

Log rotation parameter notes:

41.2.4 Structured Log Field Examples

A typical completed tool-call log entry:

{
  "timestamp": "2026-04-26T09:23:41.882Z",
  "level": "info",
  "event": "tool_call_completed",
  "agentId": "agent_7f3k2m",
  "sessionId": "sess_9a1b3c",
  "tool": "bash",
  "duration": 342,
  "tokenUsage": {
    "input": 1204,
    "output": 387
  },
  "exitCode": 0,
  "error": null,
  "service": "openclaw",
  "environment": "production",
  "version": "2026.4.22"
}

A Session start log entry:

{
  "timestamp": "2026-04-26T09:23:40.001Z",
  "level": "info",
  "event": "session_started",
  "agentId": "agent_7f3k2m",
  "sessionId": "sess_9a1b3c",
  "trigger": "cron",
  "model": "claude-opus-4-5",
  "thinkingMode": "standard",
  "maxTokenBudget": 50000,
  "parentSessionId": null,
  "service": "openclaw"
}

A tool failure log entry:

{
  "timestamp": "2026-04-26T09:24:11.543Z",
  "level": "error",
  "event": "tool_call_failed",
  "agentId": "agent_7f3k2m",
  "sessionId": "sess_9a1b3c",
  "tool": "web_fetch",
  "duration": 30001,
  "error": {
    "code": "TIMEOUT",
    "message": "Request exceeded 30s timeout",
    "retryCount": 2,
    "willRetry": false
  },
  "service": "openclaw"
}

41.3 Health Check Endpoint Analysis

The Gateway exposes a local health check endpoint for fast, point-in-time status assessment.

41.3.1 Endpoint Address

http://127.0.0.1:18789/health

By default this listens only on the loopback interface and is not exposed externally. To enable remote access for CI/CD or monitoring systems, configure explicitly:

{
  "gateway": {
    "health": {
      "host": "0.0.0.0",
      "port": 18789,
      "auth": {
        "enabled": true,
        "token": "${HEALTH_CHECK_TOKEN}"
      }
    }
  }
}

41.3.2 Full Response Field Analysis

{
  "status": "healthy",
  "version": "2026.4.22",
  "uptime": 86432,
  "gateway": {
    "status": "running",
    "pid": 12847,
    "port": 18789,
    "connections": {
      "active": 3,
      "idle": 12,
      "total": 58204
    }
  },
  "sessions": {
    "active": 2,
    "queued": 0,
    "completed24h": 147,
    "failed24h": 3,
    "successRate24h": 0.9796
  },
  "models": {
    "current": "claude-opus-4-5",
    "available": ["claude-opus-4-5", "claude-sonnet-4-5", "claude-haiku-4-5"],
    "lastModelCallMs": 1843
  },
  "resources": {
    "memoryUsedMB": 412,
    "memoryLimitMB": 2048,
    "cpuPercent": 12.4,
    "diskAvailableGB": 48.3
  },
  "tools": {
    "registered": 28,
    "errorsLastHour": 1,
    "errorRate1h": 0.012
  },
  "compaction": {
    "count24h": 7,
    "lastCompactionAt": "2026-04-26T08:41:12Z"
  },
  "timestamp": "2026-04-26T09:30:00Z"
}

Field monitoring value notes:

41.3.3 Using curl for Health Checks

# Basic status check
curl -s http://127.0.0.1:18789/health | jq '.status'

# Check Session success rate
curl -s http://127.0.0.1:18789/health | jq '.sessions.successRate24h'

# Check memory utilization percentage
curl -s http://127.0.0.1:18789/health | \
  jq '(.resources.memoryUsedMB / .resources.memoryLimitMB * 100 | round | tostring) + "% memory used"'

# Kubernetes readiness probe usage
curl -sf http://127.0.0.1:18789/health && echo "READY" || echo "NOT READY"

41.4 Monitoring Value of Key Log Fields

41.4.1 Field Index and Meaning

Field Type Monitoring Value
timestamp ISO 8601 string Millisecond precision; used for latency calculation and time-series correlation
level Enum Filter severe events; error level demands immediate attention
agentId String Track the behavioral pattern of a single Agent across Sessions
sessionId String Groups all logs for a single task; the core dimension for root cause analysis
tool String Identifies high-error-rate tools; pinpoints configuration or permission issues
duration Integer (ms) P99 latency analysis; detects tool performance degradation
error.code String Error classification (TIMEOUT / AUTH_FAIL / RATE_LIMIT, etc.)
error.retryCount Integer High retry counts indicate upstream instability
tokenUsage.input Integer Context growth rate; predicts when Compaction will trigger
tokenUsage.output Integer Model output volume; correlates with quality assessment

41.4.2 Combined Field Query Examples

Using jq to extract analytical value from log files:

# Count errors by tool
cat ~/.openclaw/logs/openclaw.log | \
  jq -r 'select(.event == "tool_call_failed") | .tool' | \
  sort | uniq -c | sort -rn

# Find the slowest tool calls (over 10 seconds)
cat ~/.openclaw/logs/openclaw.log | \
  jq 'select(.duration > 10000) | {tool, duration, sessionId}' | \
  jq -s 'sort_by(-.duration) | .[0:10]'

# Calculate total token consumption for a specific Session
SESSION_ID="sess_9a1b3c"
cat ~/.openclaw/logs/openclaw.log | \
  jq --arg sid "$SESSION_ID" \
  'select(.sessionId == $sid and .tokenUsage != null) | .tokenUsage.input + .tokenUsage.output' | \
  paste -sd+ | bc

# Find Sessions that failed repeatedly
cat ~/.openclaw/logs/openclaw.log | \
  jq -r 'select(.event == "session_failed") | .sessionId' | \
  awk 'seen[$0]++ {print $0 " appeared " seen[$0] " times"}' | head -10

41.5 10 Key Metrics and Reference Thresholds

41.5.1 Metric Definitions

Metric Calculation Warning Threshold Critical Threshold Notes
Session success rate successes / total (24h rolling) < 0.95 < 0.90 Core business health
Tool error rate failed tool calls / total (1h) > 0.03 > 0.08 Tool layer stability
P99 tool latency 99th percentile duration value > 5000ms > 15000ms Tool call performance
Compaction frequency count/24h per Agent > 10 > 25 Token budget fit
LLM response time last modelCallMs > 3000ms > 8000ms API health status
Tokens per Session average tokenUsage.input + output > 40000 > 80000 Cost control
Memory utilization usedMB / limitMB > 0.75 > 0.90 Capacity planning
Gateway restarts event=gateway_restart (24h) > 1 > 3 Stability indicator
Active connections connections.active > 50 > 100 Concurrency capacity
High-retry tool calls ratio with retryCount > 2 > 0.05 > 0.15 Upstream dependency stability

41.5.2 Business Interpretation of Metrics

Session success rate is the most important business indicator. Dropping below 95% typically means a commonly used tool has a configuration problem, or the LLM is unable to correctly interpret the instructions in SKILL.md. Prioritize investigating what changed recently.

Compaction frequency is a hidden performance signal. Every Compaction truncates the context, which can cause the Agent to forget earlier operation results, leading to duplicated work or logical errors. High frequency indicates that task design didn't adequately account for token budgets.

Average tokens per Session directly correlates with LLM API cost. Sessions exceeding 80,000 tokens may indicate inefficient context inflation — for example, injecting entire large files into context rather than using tools to read them on demand.


41.6 Integration with Prometheus / Grafana

OpenClaw's structured logs provide a natural integration point with existing monitoring stacks. The recommended approach uses prometheus-json-exporter to bridge the health check endpoint, and Loki or Vector to process structured logs.

41.6.1 prometheus-json-exporter Configuration

Installation:

# macOS
brew install prometheus-json-exporter

# Or using Docker
docker run -d -p 7979:7979 \
  -v $(pwd)/json_exporter.yml:/etc/json_exporter/config.yml \
  prometheuscommunity/json-exporter \
  --config.file /etc/json_exporter/config.yml

json_exporter.yml configuration (scraping the OpenClaw health check endpoint):

modules:
  openclaw:
    metrics:
      - name: openclaw_session_success_rate
        type: gauge
        help: "Session success rate over last 24 hours"
        path: "{ .sessions.successRate24h }"

      - name: openclaw_session_active
        type: gauge
        help: "Currently active sessions"
        path: "{ .sessions.active }"

      - name: openclaw_tool_error_rate_1h
        type: gauge
        help: "Tool error rate in the last hour"
        path: "{ .tools.errorRate1h }"

      - name: openclaw_memory_used_mb
        type: gauge
        help: "Gateway memory used in MB"
        path: "{ .resources.memoryUsedMB }"

      - name: openclaw_memory_limit_mb
        type: gauge
        help: "Gateway memory limit in MB"
        path: "{ .resources.memoryLimitMB }"

      - name: openclaw_compaction_count_24h
        type: gauge
        help: "Context compaction events in last 24 hours"
        path: "{ .compaction.count24h }"

      - name: openclaw_last_model_call_ms
        type: gauge
        help: "Last LLM API call latency in milliseconds"
        path: "{ .models.lastModelCallMs }"

      - name: openclaw_gateway_uptime_seconds
        type: counter
        help: "Gateway uptime in seconds"
        path: "{ .uptime }"

Add to Prometheus scrape_configs:

scrape_configs:
  - job_name: 'openclaw'
    metrics_path: /probe
    params:
      module: [openclaw]
      target: ['http://127.0.0.1:18789/health']
    static_configs:
      - targets: ['localhost:7979']
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: 'localhost:7979'

41.6.2 Vector Log Collection Configuration

Use Vector to ship OpenClaw's JSON logs to Loki or Elasticsearch:

# vector.toml

[sources.openclaw_logs]
type = "file"
include = ["~/.openclaw/logs/openclaw.log"]
read_from = "beginning"

[transforms.parse_openclaw]
type = "remap"
inputs = ["openclaw_logs"]
source = '''
  . = parse_json!(string!(.message))
  .source = "openclaw"
'''

[transforms.filter_errors]
type = "filter"
inputs = ["parse_openclaw"]
condition = '.level == "error" || .level == "warn"'

[sinks.loki]
type = "loki"
inputs = ["parse_openclaw"]
endpoint = "http://localhost:3100"
labels.job = "openclaw"
labels.level = "{{ level }}"
labels.agent_id = "{{ agentId }}"
encoding.codec = "json"

[sinks.error_alerts]
type = "http"
inputs = ["filter_errors"]
uri = "http://alertmanager:9093/api/v1/alerts"
encoding.codec = "json"

41.6.3 Grafana Dashboard Core Panels

Recommended 4-row Dashboard layout:

Row 1: Overview    (Session success rate / Active Sessions / Tool error rate / Gateway uptime)
Row 2: Performance (LLM response latency trend / P99 tool latency / Token consumption heatmap)
Row 3: Resources   (Memory utilization / Compaction frequency trend / Connection count)
Row 4: Error Detail(Error log stream / Error distribution by tool / High-retry calls)

Key PromQL queries:

# Session success rate (highlight when below 0.95)
openclaw_session_success_rate < 0.95

# Memory utilization ratio
openclaw_memory_used_mb / openclaw_memory_limit_mb

# Tool error rate change trend (5-minute average)
avg_over_time(openclaw_tool_error_rate_1h[5m])

# Is Gateway online (1 = online, 0 = offline)
up{job="openclaw"}

41.7 Alert Rule Design

41.7.1 Prometheus AlertManager Rules

# openclaw_alerts.yml
groups:
  - name: openclaw_critical
    rules:
      - alert: OpenClawGatewayDown
        expr: up{job="openclaw"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "OpenClaw Gateway is unreachable"
          description: "The health check endpoint has not responded for 1 minute. Check the Gateway process immediately."

      - alert: OpenClawSessionSuccessRateLow
        expr: openclaw_session_success_rate < 0.90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Session success rate critically low (current: {{ $value | humanize }})"
          description: "24-hour Session success rate has fallen below 90%, indicating possible systemic failure."

      - alert: OpenClawMemoryOverLimit
        expr: (openclaw_memory_used_mb / openclaw_memory_limit_mb) > 0.90
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Gateway memory usage exceeds 90%"
          description: "Current memory usage is {{ $value | humanizePercentage }}. OOM risk is high. Restart or scale up immediately."

  - name: openclaw_warning
    rules:
      - alert: OpenClawToolErrorRateHigh
        expr: openclaw_tool_error_rate_1h > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Tool error rate too high ({{ $value | humanizePercentage }})"
          description: "Tool call error rate exceeded 5% in the last hour. Check tool configuration and permissions."

      - alert: OpenClawCompactionFrequencyHigh
        expr: openclaw_compaction_count_24h > 20
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Context Compaction frequency too high ({{ $value }} times/day)"
          description: "More than 20 Compaction events per day may cause context loss and logical errors. Optimize token budget configuration."

      - alert: OpenClawLLMLatencyHigh
        expr: openclaw_last_model_call_ms > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM API response latency too high ({{ $value }}ms)"
          description: "Last LLM call took over 5 seconds. Possible API throttling or network issue."

      - alert: OpenClawSessionSuccessRateDegraded
        expr: openclaw_session_success_rate < 0.95
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Session success rate declining (current: {{ $value | humanize }})"
          description: "Success rate below 95%. Review recent failed Session logs."

41.7.2 Alert Notification Channel Configuration

# alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 30m
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: '${PAGERDUTY_KEY}'
        description: '{{ .CommonAnnotations.summary }}'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#openclaw-alerts'
        title: '[{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

41.8 The openclaw doctor Command

openclaw doctor is the built-in diagnostic tool and should be the first command you run when an alert fires or when you suspect a configuration issue.

41.8.1 Doctor Check Items

openclaw gateway doctor

Typical output:

OpenClaw Diagnostic Report — 2026-04-26 09:30:00

[✓] Gateway process running (PID: 12847)
[✓] Gateway port 18789 accessible
[✓] Health check responding (status: healthy)
[✓] Configuration file valid (openclaw.json)
[✓] Model API reachable (claude-opus-4-5, latency: 1843ms)
[✓] Log directory writable (~/.openclaw/logs)
[✗] Disk space low: 2.1 GB available (threshold: 5 GB)
[✓] All registered tools (28) responding to ping
[!] Tool 'web_fetch' error rate 8.2% in last hour (threshold: 5%)
[✓] No active sessions in error state
[✓] Memory usage 20.1% (412/2048 MB)
[✓] Context compaction count: 7/day (threshold: 20)

Summary: 1 error, 1 warning
Action required: Free disk space; investigate web_fetch errors

41.8.2 Common doctor Subcommands

# Check only Gateway connectivity
openclaw gateway doctor --check connectivity

# Check configuration completeness of all tools
openclaw gateway doctor --check tools

# Output machine-readable JSON format
openclaw gateway doctor --format json

# Verbose mode (prints raw data for each check)
openclaw gateway doctor --verbose

# Check configuration for a specific Agent
openclaw gateway doctor --agent my-agent-id

41.9 Debugging Techniques

41.9.1 Dynamic Log Level Adjustment

Change log levels dynamically without restarting the Gateway:

# Switch to debug to capture full tool-call detail
openclaw config set logging.level debug

# Stream and pretty-print logs in real time
tail -f ~/.openclaw/logs/openclaw.log | jq '.'

# Filter to a specific Session
tail -f ~/.openclaw/logs/openclaw.log | \
  jq 'select(.sessionId == "sess_9a1b3c")'

# Show only errors and warnings
tail -f ~/.openclaw/logs/openclaw.log | \
  jq 'select(.level == "error" or .level == "warn")'

# Restore info level after debugging
openclaw config set logging.level info

41.9.2 Manual RPC Testing

The Gateway exposes RPC endpoints for manual tool testing:

# List registered tools
curl -X POST http://127.0.0.1:18789/rpc \
  -H "Content-Type: application/json" \
  -d '{"method": "tools/list", "id": 1}'

# Manually invoke a tool (testing the bash tool)
curl -X POST http://127.0.0.1:18789/rpc \
  -H "Content-Type: application/json" \
  -d '{
    "method": "tools/call",
    "params": {
      "name": "bash",
      "arguments": {"command": "echo hello"}
    },
    "id": 2
  }'

# Check current Session states
curl -s http://127.0.0.1:18789/sessions | jq '.'

41.9.3 Sandbox Mode Debugging

Use openclaw sandbox explain to analyze tool-call paths in a sandboxed environment without triggering real side effects:

# Run an Agent in sandbox mode and explain its decision-making
openclaw sandbox explain \
  --agent my-agent-id \
  --message "sort today's emails" \
  --show-tool-calls \
  --show-reasoning

41.10 Production Problem Diagnosis Workflow

41.10.1 Standard Diagnostic SOP

When an alert fires, follow this sequence:

Step 1: Confirm Gateway is online

curl -sf http://127.0.0.1:18789/health | jq '.status'
# If no response: openclaw gateway status
# If Gateway is stopped: openclaw gateway dashboard for history

Step 2: Run doctor

openclaw gateway doctor --format json > /tmp/doctor_$(date +%s).json
cat /tmp/doctor_*.json | jq '.issues'

Step 3: Review failed Session logs

# Find the most recently failed Session IDs
cat ~/.openclaw/logs/openclaw.log | \
  jq 'select(.event == "session_failed") | {sessionId, timestamp, error}' | \
  tail -5

# Extract the complete log for a specific Session
SESSION_ID="sess_xxxx"
cat ~/.openclaw/logs/openclaw.log | \
  jq --arg sid "$SESSION_ID" 'select(.sessionId == $sid)'

Step 4: Identify the root-cause tool

# Find which tool has the most errors
cat ~/.openclaw/logs/openclaw.log | \
  jq 'select(.event == "tool_call_failed") | .tool' | \
  sort | uniq -c | sort -rn | head -5

Step 5: Validate the fix

# Manually test the problematic tool
openclaw sandbox explain --message "test <tool-name> operation" --show-tool-calls

# Re-run doctor to confirm resolution
openclaw gateway doctor

41.10.2 Common Issue Quick Reference

Symptom Most Likely Cause Quick Verification Command
Gateway won't start Port conflict or config file error openclaw gateway doctor --check config
All Sessions timing out LLM API key invalid or network issue openclaw models list
Tool calls getting permission errors Incomplete permission declaration in SKILL.md openclaw security audit
Memory growing continuously Large file content being injected into Context Review tokenUsage trend
Frequent Compaction maxTokenBudget set too low Adjust maxTokenBudget parameter

The next chapter presents 10 real-world deployment scenarios, showing how OpenClaw is applied across different business contexts.

Rate this chapter
4.9  / 5  (3 ratings)

💬 Comments