Monitoring and Observability: Structured Logs, Health Checks and Alerting Config
Chapter 41ใMonitoring and Observability: Structured Logging, Health Checks, and Alert Configuration
41.1ใThe Observability Challenge for AI Agents
Traditional software behaves predictably: given the same input, the output is stable and consistent. AI Agents work fundamentally differently โ their decisions are driven by large language models, meaning that the same instruction can take completely different tool-calling paths at different times and in different contexts. This non-determinism renders conventional monitoring approaches severely inadequate for Agent workloads.
41.1.1ใWhy AI Agents Are Harder to Monitor
Path explosion: A simple instruction like "clean up my inbox" might trigger anywhere from 3 to 20 tool calls, depending on the current state of the mailbox, model version, and context window length. You can't monitor a fixed call chain the way you would with a traditional API.
Semantic ambiguity of errors: When a tool returns HTTP 200 but the content says "no relevant data found," the Agent might handle it gracefully โ or it might enter an infinite retry loop. Traditional error-rate metrics can't capture these semantic-level failures.
Temporal scale variation: A single Session might last 30 seconds (a simple query) or 4 hours (a complex refactoring task). Fixed time-window monitoring metrics have limited value in this context.
Unpredictable resource consumption: LLM token usage and Context Compaction frequency directly impact cost and latency, yet both are difficult to estimate before a task begins.
Irreversible tool side effects: Agent actions like deleting files, sending emails, or deploying services are one-directional. Monitoring must alert before problems occur โ not just record them afterward.
41.1.2ใOpenClaw's Observability Design Philosophy
OpenClaw adopts a structured-log-first design: all runtime events are emitted in JSON format, with each field carrying a fixed semantic meaning that makes machine parsing and aggregation straightforward. Health check endpoints provide real-time status snapshots rather than relying on the latency of log aggregation. Alert rules are built on business semantics (consecutive failures, abnormal Sessions) rather than simple technical thresholds (CPU > 80%).
These three layers together form a complete observability system for AI Agents: logs (what happened), health checks (what the current state is), and alerts (what requires immediate action).
41.2ใLogging Configuration in Detail
OpenClaw's logging behavior is controlled via the logging section of openclaw.json.
41.2.1ใComplete Configuration Structure
{
"logging": {
"level": "info",
"format": "json",
"output": {
"console": {
"enabled": true,
"colorize": false
},
"file": {
"enabled": true,
"path": "~/.openclaw/logs/openclaw.log",
"rotation": {
"maxSize": "100MB",
"maxFiles": 7,
"compress": true
}
},
"syslog": {
"enabled": false,
"facility": "local0",
"host": "127.0.0.1",
"port": 514
}
},
"fields": {
"service": "openclaw",
"environment": "production",
"version": "${OPENCLAW_VERSION}"
},
"sampling": {
"enabled": false,
"rate": 0.1,
"excludeLevels": ["error", "warn"]
}
}
}
41.2.2ใChoosing the Right Log Level
| Level | Recommended Use | Approximate Volume |
|---|---|---|
debug |
Local development, active troubleshooting | Very high (10โ50 entries per tool call) |
info |
Production default, records key events | Moderate (3โ8 entries per tool call) |
warn |
Only surface potential issues | Low (exceptional conditions only) |
error |
Only record failures | Very low (alert-only use) |
Production recommendation: Use info by default. When debugging a specific Session, you can switch levels dynamically without restarting the Gateway:
# Temporarily upgrade to debug for full tool-call details
openclaw config set logging.level debug
# Enable debug only for a specific Agent
openclaw config set logging.agentFilter "my-agent-id:debug"
# Restore default after debugging
openclaw config set logging.level info
41.2.3ใLog Output Path Planning
Recommended production directory structure:
~/.openclaw/logs/
โโโ openclaw.log # Active log file (rolling writes)
โโโ openclaw.2026-04-25.gz # Compressed archive (auto-generated)
โโโ openclaw.2026-04-24.gz
โโโ gateway-access.log # Gateway HTTP access log (separate)
Log rotation parameter notes:
maxSize: "100MB"โ triggers rotation when a single file exceeds 100 MB; reduce to20MBfor personal usemaxFiles: 7โ retains 7 days of logs, sufficient for most post-incident audits; extend to 30 for compliance scenarioscompress: trueโ gzip compression on archived files saves approximately 70% disk space
41.2.4ใStructured Log Field Examples
A typical completed tool-call log entry:
{
"timestamp": "2026-04-26T09:23:41.882Z",
"level": "info",
"event": "tool_call_completed",
"agentId": "agent_7f3k2m",
"sessionId": "sess_9a1b3c",
"tool": "bash",
"duration": 342,
"tokenUsage": {
"input": 1204,
"output": 387
},
"exitCode": 0,
"error": null,
"service": "openclaw",
"environment": "production",
"version": "2026.4.22"
}
A Session start log entry:
{
"timestamp": "2026-04-26T09:23:40.001Z",
"level": "info",
"event": "session_started",
"agentId": "agent_7f3k2m",
"sessionId": "sess_9a1b3c",
"trigger": "cron",
"model": "claude-opus-4-5",
"thinkingMode": "standard",
"maxTokenBudget": 50000,
"parentSessionId": null,
"service": "openclaw"
}
A tool failure log entry:
{
"timestamp": "2026-04-26T09:24:11.543Z",
"level": "error",
"event": "tool_call_failed",
"agentId": "agent_7f3k2m",
"sessionId": "sess_9a1b3c",
"tool": "web_fetch",
"duration": 30001,
"error": {
"code": "TIMEOUT",
"message": "Request exceeded 30s timeout",
"retryCount": 2,
"willRetry": false
},
"service": "openclaw"
}
41.3ใHealth Check Endpoint Analysis
The Gateway exposes a local health check endpoint for fast, point-in-time status assessment.
41.3.1ใEndpoint Address
http://127.0.0.1:18789/health
By default this listens only on the loopback interface and is not exposed externally. To enable remote access for CI/CD or monitoring systems, configure explicitly:
{
"gateway": {
"health": {
"host": "0.0.0.0",
"port": 18789,
"auth": {
"enabled": true,
"token": "${HEALTH_CHECK_TOKEN}"
}
}
}
}
41.3.2ใFull Response Field Analysis
{
"status": "healthy",
"version": "2026.4.22",
"uptime": 86432,
"gateway": {
"status": "running",
"pid": 12847,
"port": 18789,
"connections": {
"active": 3,
"idle": 12,
"total": 58204
}
},
"sessions": {
"active": 2,
"queued": 0,
"completed24h": 147,
"failed24h": 3,
"successRate24h": 0.9796
},
"models": {
"current": "claude-opus-4-5",
"available": ["claude-opus-4-5", "claude-sonnet-4-5", "claude-haiku-4-5"],
"lastModelCallMs": 1843
},
"resources": {
"memoryUsedMB": 412,
"memoryLimitMB": 2048,
"cpuPercent": 12.4,
"diskAvailableGB": 48.3
},
"tools": {
"registered": 28,
"errorsLastHour": 1,
"errorRate1h": 0.012
},
"compaction": {
"count24h": 7,
"lastCompactionAt": "2026-04-26T08:41:12Z"
},
"timestamp": "2026-04-26T09:30:00Z"
}
Field monitoring value notes:
status: Top-level health state, one ofhealthy/degraded/unhealthy.degradedmeans some functionality is impaired but the Gateway is still operationalsessions.successRate24h: Rolling 24-hour Session success rate; alert when below 0.95sessions.failed24h: Absolute failure count โ combine with success rate to account for low-traffic periods where the rate may appear artificially highresources.memoryUsedMB / memoryLimitMB: Memory usage ratio; above 85% warrants restart or limit increasetools.errorRate1h: Tool error rate in the last hour; above 0.05 (5%) warrants investigationcompaction.count24h: Context Compaction event count; frequent triggering (>20/day) indicates a mismatch between task scale and token budgetmodels.lastModelCallMs: Last LLM call latency; useful for detecting API throttling
41.3.3ใUsing curl for Health Checks
# Basic status check
curl -s http://127.0.0.1:18789/health | jq '.status'
# Check Session success rate
curl -s http://127.0.0.1:18789/health | jq '.sessions.successRate24h'
# Check memory utilization percentage
curl -s http://127.0.0.1:18789/health | \
jq '(.resources.memoryUsedMB / .resources.memoryLimitMB * 100 | round | tostring) + "% memory used"'
# Kubernetes readiness probe usage
curl -sf http://127.0.0.1:18789/health && echo "READY" || echo "NOT READY"
41.4ใMonitoring Value of Key Log Fields
41.4.1ใField Index and Meaning
| Field | Type | Monitoring Value |
|---|---|---|
timestamp |
ISO 8601 string | Millisecond precision; used for latency calculation and time-series correlation |
level |
Enum | Filter severe events; error level demands immediate attention |
agentId |
String | Track the behavioral pattern of a single Agent across Sessions |
sessionId |
String | Groups all logs for a single task; the core dimension for root cause analysis |
tool |
String | Identifies high-error-rate tools; pinpoints configuration or permission issues |
duration |
Integer (ms) | P99 latency analysis; detects tool performance degradation |
error.code |
String | Error classification (TIMEOUT / AUTH_FAIL / RATE_LIMIT, etc.) |
error.retryCount |
Integer | High retry counts indicate upstream instability |
tokenUsage.input |
Integer | Context growth rate; predicts when Compaction will trigger |
tokenUsage.output |
Integer | Model output volume; correlates with quality assessment |
41.4.2ใCombined Field Query Examples
Using jq to extract analytical value from log files:
# Count errors by tool
cat ~/.openclaw/logs/openclaw.log | \
jq -r 'select(.event == "tool_call_failed") | .tool' | \
sort | uniq -c | sort -rn
# Find the slowest tool calls (over 10 seconds)
cat ~/.openclaw/logs/openclaw.log | \
jq 'select(.duration > 10000) | {tool, duration, sessionId}' | \
jq -s 'sort_by(-.duration) | .[0:10]'
# Calculate total token consumption for a specific Session
SESSION_ID="sess_9a1b3c"
cat ~/.openclaw/logs/openclaw.log | \
jq --arg sid "$SESSION_ID" \
'select(.sessionId == $sid and .tokenUsage != null) | .tokenUsage.input + .tokenUsage.output' | \
paste -sd+ | bc
# Find Sessions that failed repeatedly
cat ~/.openclaw/logs/openclaw.log | \
jq -r 'select(.event == "session_failed") | .sessionId' | \
awk 'seen[$0]++ {print $0 " appeared " seen[$0] " times"}' | head -10
41.5ใ10 Key Metrics and Reference Thresholds
41.5.1ใMetric Definitions
| Metric | Calculation | Warning Threshold | Critical Threshold | Notes |
|---|---|---|---|---|
| Session success rate | successes / total (24h rolling) | < 0.95 | < 0.90 | Core business health |
| Tool error rate | failed tool calls / total (1h) | > 0.03 | > 0.08 | Tool layer stability |
| P99 tool latency | 99th percentile duration value |
> 5000ms | > 15000ms | Tool call performance |
| Compaction frequency | count/24h per Agent | > 10 | > 25 | Token budget fit |
| LLM response time | last modelCallMs |
> 3000ms | > 8000ms | API health status |
| Tokens per Session | average tokenUsage.input + output |
> 40000 | > 80000 | Cost control |
| Memory utilization | usedMB / limitMB |
> 0.75 | > 0.90 | Capacity planning |
| Gateway restarts | event=gateway_restart (24h) |
> 1 | > 3 | Stability indicator |
| Active connections | connections.active |
> 50 | > 100 | Concurrency capacity |
| High-retry tool calls | ratio with retryCount > 2 |
> 0.05 | > 0.15 | Upstream dependency stability |
41.5.2ใBusiness Interpretation of Metrics
Session success rate is the most important business indicator. Dropping below 95% typically means a commonly used tool has a configuration problem, or the LLM is unable to correctly interpret the instructions in SKILL.md. Prioritize investigating what changed recently.
Compaction frequency is a hidden performance signal. Every Compaction truncates the context, which can cause the Agent to forget earlier operation results, leading to duplicated work or logical errors. High frequency indicates that task design didn't adequately account for token budgets.
Average tokens per Session directly correlates with LLM API cost. Sessions exceeding 80,000 tokens may indicate inefficient context inflation โ for example, injecting entire large files into context rather than using tools to read them on demand.
41.6ใIntegration with Prometheus / Grafana
OpenClaw's structured logs provide a natural integration point with existing monitoring stacks. The recommended approach uses prometheus-json-exporter to bridge the health check endpoint, and Loki or Vector to process structured logs.
41.6.1ใprometheus-json-exporter Configuration
Installation:
# macOS
brew install prometheus-json-exporter
# Or using Docker
docker run -d -p 7979:7979 \
-v $(pwd)/json_exporter.yml:/etc/json_exporter/config.yml \
prometheuscommunity/json-exporter \
--config.file /etc/json_exporter/config.yml
json_exporter.yml configuration (scraping the OpenClaw health check endpoint):
modules:
openclaw:
metrics:
- name: openclaw_session_success_rate
type: gauge
help: "Session success rate over last 24 hours"
path: "{ .sessions.successRate24h }"
- name: openclaw_session_active
type: gauge
help: "Currently active sessions"
path: "{ .sessions.active }"
- name: openclaw_tool_error_rate_1h
type: gauge
help: "Tool error rate in the last hour"
path: "{ .tools.errorRate1h }"
- name: openclaw_memory_used_mb
type: gauge
help: "Gateway memory used in MB"
path: "{ .resources.memoryUsedMB }"
- name: openclaw_memory_limit_mb
type: gauge
help: "Gateway memory limit in MB"
path: "{ .resources.memoryLimitMB }"
- name: openclaw_compaction_count_24h
type: gauge
help: "Context compaction events in last 24 hours"
path: "{ .compaction.count24h }"
- name: openclaw_last_model_call_ms
type: gauge
help: "Last LLM API call latency in milliseconds"
path: "{ .models.lastModelCallMs }"
- name: openclaw_gateway_uptime_seconds
type: counter
help: "Gateway uptime in seconds"
path: "{ .uptime }"
Add to Prometheus scrape_configs:
scrape_configs:
- job_name: 'openclaw'
metrics_path: /probe
params:
module: [openclaw]
target: ['http://127.0.0.1:18789/health']
static_configs:
- targets: ['localhost:7979']
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: 'localhost:7979'
41.6.2ใVector Log Collection Configuration
Use Vector to ship OpenClaw's JSON logs to Loki or Elasticsearch:
# vector.toml
[sources.openclaw_logs]
type = "file"
include = ["~/.openclaw/logs/openclaw.log"]
read_from = "beginning"
[transforms.parse_openclaw]
type = "remap"
inputs = ["openclaw_logs"]
source = '''
. = parse_json!(string!(.message))
.source = "openclaw"
'''
[transforms.filter_errors]
type = "filter"
inputs = ["parse_openclaw"]
condition = '.level == "error" || .level == "warn"'
[sinks.loki]
type = "loki"
inputs = ["parse_openclaw"]
endpoint = "http://localhost:3100"
labels.job = "openclaw"
labels.level = "{{ level }}"
labels.agent_id = "{{ agentId }}"
encoding.codec = "json"
[sinks.error_alerts]
type = "http"
inputs = ["filter_errors"]
uri = "http://alertmanager:9093/api/v1/alerts"
encoding.codec = "json"
41.6.3ใGrafana Dashboard Core Panels
Recommended 4-row Dashboard layout:
Row 1: Overview (Session success rate / Active Sessions / Tool error rate / Gateway uptime)
Row 2: Performance (LLM response latency trend / P99 tool latency / Token consumption heatmap)
Row 3: Resources (Memory utilization / Compaction frequency trend / Connection count)
Row 4: Error Detail(Error log stream / Error distribution by tool / High-retry calls)
Key PromQL queries:
# Session success rate (highlight when below 0.95)
openclaw_session_success_rate < 0.95
# Memory utilization ratio
openclaw_memory_used_mb / openclaw_memory_limit_mb
# Tool error rate change trend (5-minute average)
avg_over_time(openclaw_tool_error_rate_1h[5m])
# Is Gateway online (1 = online, 0 = offline)
up{job="openclaw"}
41.7ใAlert Rule Design
41.7.1ใPrometheus AlertManager Rules
# openclaw_alerts.yml
groups:
- name: openclaw_critical
rules:
- alert: OpenClawGatewayDown
expr: up{job="openclaw"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "OpenClaw Gateway is unreachable"
description: "The health check endpoint has not responded for 1 minute. Check the Gateway process immediately."
- alert: OpenClawSessionSuccessRateLow
expr: openclaw_session_success_rate < 0.90
for: 5m
labels:
severity: critical
annotations:
summary: "Session success rate critically low (current: {{ $value | humanize }})"
description: "24-hour Session success rate has fallen below 90%, indicating possible systemic failure."
- alert: OpenClawMemoryOverLimit
expr: (openclaw_memory_used_mb / openclaw_memory_limit_mb) > 0.90
for: 3m
labels:
severity: critical
annotations:
summary: "Gateway memory usage exceeds 90%"
description: "Current memory usage is {{ $value | humanizePercentage }}. OOM risk is high. Restart or scale up immediately."
- name: openclaw_warning
rules:
- alert: OpenClawToolErrorRateHigh
expr: openclaw_tool_error_rate_1h > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "Tool error rate too high ({{ $value | humanizePercentage }})"
description: "Tool call error rate exceeded 5% in the last hour. Check tool configuration and permissions."
- alert: OpenClawCompactionFrequencyHigh
expr: openclaw_compaction_count_24h > 20
for: 0m
labels:
severity: warning
annotations:
summary: "Context Compaction frequency too high ({{ $value }} times/day)"
description: "More than 20 Compaction events per day may cause context loss and logical errors. Optimize token budget configuration."
- alert: OpenClawLLMLatencyHigh
expr: openclaw_last_model_call_ms > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "LLM API response latency too high ({{ $value }}ms)"
description: "Last LLM call took over 5 seconds. Possible API throttling or network issue."
- alert: OpenClawSessionSuccessRateDegraded
expr: openclaw_session_success_rate < 0.95
for: 10m
labels:
severity: warning
annotations:
summary: "Session success rate declining (current: {{ $value | humanize }})"
description: "Success rate below 95%. Review recent failed Session logs."
41.7.2ใAlert Notification Channel Configuration
# alertmanager.yml
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 30m
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: '${PAGERDUTY_KEY}'
description: '{{ .CommonAnnotations.summary }}'
- name: 'slack-warnings'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#openclaw-alerts'
title: '[{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
41.8ใThe openclaw doctor Command
openclaw doctor is the built-in diagnostic tool and should be the first command you run when an alert fires or when you suspect a configuration issue.
41.8.1ใDoctor Check Items
openclaw gateway doctor
Typical output:
OpenClaw Diagnostic Report โ 2026-04-26 09:30:00
[โ] Gateway process running (PID: 12847)
[โ] Gateway port 18789 accessible
[โ] Health check responding (status: healthy)
[โ] Configuration file valid (openclaw.json)
[โ] Model API reachable (claude-opus-4-5, latency: 1843ms)
[โ] Log directory writable (~/.openclaw/logs)
[โ] Disk space low: 2.1 GB available (threshold: 5 GB)
[โ] All registered tools (28) responding to ping
[!] Tool 'web_fetch' error rate 8.2% in last hour (threshold: 5%)
[โ] No active sessions in error state
[โ] Memory usage 20.1% (412/2048 MB)
[โ] Context compaction count: 7/day (threshold: 20)
Summary: 1 error, 1 warning
Action required: Free disk space; investigate web_fetch errors
41.8.2ใCommon doctor Subcommands
# Check only Gateway connectivity
openclaw gateway doctor --check connectivity
# Check configuration completeness of all tools
openclaw gateway doctor --check tools
# Output machine-readable JSON format
openclaw gateway doctor --format json
# Verbose mode (prints raw data for each check)
openclaw gateway doctor --verbose
# Check configuration for a specific Agent
openclaw gateway doctor --agent my-agent-id
41.9ใDebugging Techniques
41.9.1ใDynamic Log Level Adjustment
Change log levels dynamically without restarting the Gateway:
# Switch to debug to capture full tool-call detail
openclaw config set logging.level debug
# Stream and pretty-print logs in real time
tail -f ~/.openclaw/logs/openclaw.log | jq '.'
# Filter to a specific Session
tail -f ~/.openclaw/logs/openclaw.log | \
jq 'select(.sessionId == "sess_9a1b3c")'
# Show only errors and warnings
tail -f ~/.openclaw/logs/openclaw.log | \
jq 'select(.level == "error" or .level == "warn")'
# Restore info level after debugging
openclaw config set logging.level info
41.9.2ใManual RPC Testing
The Gateway exposes RPC endpoints for manual tool testing:
# List registered tools
curl -X POST http://127.0.0.1:18789/rpc \
-H "Content-Type: application/json" \
-d '{"method": "tools/list", "id": 1}'
# Manually invoke a tool (testing the bash tool)
curl -X POST http://127.0.0.1:18789/rpc \
-H "Content-Type: application/json" \
-d '{
"method": "tools/call",
"params": {
"name": "bash",
"arguments": {"command": "echo hello"}
},
"id": 2
}'
# Check current Session states
curl -s http://127.0.0.1:18789/sessions | jq '.'
41.9.3ใSandbox Mode Debugging
Use openclaw sandbox explain to analyze tool-call paths in a sandboxed environment without triggering real side effects:
# Run an Agent in sandbox mode and explain its decision-making
openclaw sandbox explain \
--agent my-agent-id \
--message "sort today's emails" \
--show-tool-calls \
--show-reasoning
41.10ใProduction Problem Diagnosis Workflow
41.10.1ใStandard Diagnostic SOP
When an alert fires, follow this sequence:
Step 1: Confirm Gateway is online
curl -sf http://127.0.0.1:18789/health | jq '.status'
# If no response: openclaw gateway status
# If Gateway is stopped: openclaw gateway dashboard for history
Step 2: Run doctor
openclaw gateway doctor --format json > /tmp/doctor_$(date +%s).json
cat /tmp/doctor_*.json | jq '.issues'
Step 3: Review failed Session logs
# Find the most recently failed Session IDs
cat ~/.openclaw/logs/openclaw.log | \
jq 'select(.event == "session_failed") | {sessionId, timestamp, error}' | \
tail -5
# Extract the complete log for a specific Session
SESSION_ID="sess_xxxx"
cat ~/.openclaw/logs/openclaw.log | \
jq --arg sid "$SESSION_ID" 'select(.sessionId == $sid)'
Step 4: Identify the root-cause tool
# Find which tool has the most errors
cat ~/.openclaw/logs/openclaw.log | \
jq 'select(.event == "tool_call_failed") | .tool' | \
sort | uniq -c | sort -rn | head -5
Step 5: Validate the fix
# Manually test the problematic tool
openclaw sandbox explain --message "test <tool-name> operation" --show-tool-calls
# Re-run doctor to confirm resolution
openclaw gateway doctor
41.10.2ใCommon Issue Quick Reference
| Symptom | Most Likely Cause | Quick Verification Command |
|---|---|---|
| Gateway won't start | Port conflict or config file error | openclaw gateway doctor --check config |
| All Sessions timing out | LLM API key invalid or network issue | openclaw models list |
| Tool calls getting permission errors | Incomplete permission declaration in SKILL.md | openclaw security audit |
| Memory growing continuously | Large file content being injected into Context | Review tokenUsage trend |
| Frequent Compaction | maxTokenBudget set too low |
Adjust maxTokenBudget parameter |
The next chapter presents 10 real-world deployment scenarios, showing how OpenClaw is applied across different business contexts.