Chapter 41

Monitoring and Observability: Structured Logs, Health Checks and Alerting Config

Chapter 41　Monitoring and Observability: Structured Logging, Health Checks, and Alert Configuration

41.1　The Observability Challenge for AI Agents

Traditional software behaves predictably: given the same input, the output is stable and consistent. AI Agents work fundamentally differently — their decisions are driven by large language models, meaning that the same instruction can take completely different tool-calling paths at different times and in different contexts. This non-determinism renders conventional monitoring approaches severely inadequate for Agent workloads.

41.1.1　Why AI Agents Are Harder to Monitor

Path explosion: A simple instruction like "clean up my inbox" might trigger anywhere from 3 to 20 tool calls, depending on the current state of the mailbox, model version, and context window length. You can't monitor a fixed call chain the way you would with a traditional API.

Semantic ambiguity of errors: When a tool returns HTTP 200 but the content says "no relevant data found," the Agent might handle it gracefully — or it might enter an infinite retry loop. Traditional error-rate metrics can't capture these semantic-level failures.

Temporal scale variation: A single Session might last 30 seconds (a simple query) or 4 hours (a complex refactoring task). Fixed time-window monitoring metrics have limited value in this context.

Unpredictable resource consumption: LLM token usage and Context Compaction frequency directly impact cost and latency, yet both are difficult to estimate before a task begins.

Irreversible tool side effects: Agent actions like deleting files, sending emails, or deploying services are one-directional. Monitoring must alert before problems occur — not just record them afterward.

41.1.2　OpenClaw's Observability Design Philosophy

OpenClaw adopts a structured-log-first design: all runtime events are emitted in JSON format, with each field carrying a fixed semantic meaning that makes machine parsing and aggregation straightforward. Health check endpoints provide real-time status snapshots rather than relying on the latency of log aggregation. Alert rules are built on business semantics (consecutive failures, abnormal Sessions) rather than simple technical thresholds (CPU > 80%).

These three layers together form a complete observability system for AI Agents: logs (what happened), health checks (what the current state is), and alerts (what requires immediate action).

41.2　Logging Configuration in Detail

OpenClaw's logging behavior is controlled via the logging section of openclaw.json.

41.2.1　Complete Configuration Structure

{
  "logging": {
    "level": "info",
    "format": "json",
    "output": {
      "console": {
        "enabled": true,
        "colorize": false
      },
      "file": {
        "enabled": true,
        "path": "~/.openclaw/logs/openclaw.log",
        "rotation": {
          "maxSize": "100MB",
          "maxFiles": 7,
          "compress": true
        }
      },
      "syslog": {
        "enabled": false,
        "facility": "local0",
        "host": "127.0.0.1",
        "port": 514
      }
    },
    "fields": {
      "service": "openclaw",
      "environment": "production",
      "version": "${OPENCLAW_VERSION}"
    },
    "sampling": {
      "enabled": false,
      "rate": 0.1,
      "excludeLevels": ["error", "warn"]
    }
  }
}

41.2.2　Choosing the Right Log Level

Level	Recommended Use	Approximate Volume
`debug`	Local development, active troubleshooting	Very high (10–50 entries per tool call)
`info`	Production default, records key events	Moderate (3–8 entries per tool call)
`warn`	Only surface potential issues	Low (exceptional conditions only)
`error`	Only record failures	Very low (alert-only use)

Production recommendation: Use info by default. When debugging a specific Session, you can switch levels dynamically without restarting the Gateway:

# Temporarily upgrade to debug for full tool-call details
openclaw config set logging.level debug

# Enable debug only for a specific Agent
openclaw config set logging.agentFilter "my-agent-id:debug"

# Restore default after debugging
openclaw config set logging.level info

41.2.3　Log Output Path Planning

Recommended production directory structure:

~/.openclaw/logs/
├── openclaw.log           # Active log file (rolling writes)
├── openclaw.2026-04-25.gz # Compressed archive (auto-generated)
├── openclaw.2026-04-24.gz
└── gateway-access.log     # Gateway HTTP access log (separate)

Log rotation parameter notes:

maxSize: "100MB" — triggers rotation when a single file exceeds 100 MB; reduce to 20MB for personal use
maxFiles: 7 — retains 7 days of logs, sufficient for most post-incident audits; extend to 30 for compliance scenarios
compress: true — gzip compression on archived files saves approximately 70% disk space

41.2.4　Structured Log Field Examples

A typical completed tool-call log entry:

{
  "timestamp": "2026-04-26T09:23:41.882Z",
  "level": "info",
  "event": "tool_call_completed",
  "agentId": "agent_7f3k2m",
  "sessionId": "sess_9a1b3c",
  "tool": "bash",
  "duration": 342,
  "tokenUsage": {
    "input": 1204,
    "output": 387
  },
  "exitCode": 0,
  "error": null,
  "service": "openclaw",
  "environment": "production",
  "version": "2026.4.22"
}

A Session start log entry:

{
  "timestamp": "2026-04-26T09:23:40.001Z",
  "level": "info",
  "event": "session_started",
  "agentId": "agent_7f3k2m",
  "sessionId": "sess_9a1b3c",
  "trigger": "cron",
  "model": "claude-opus-4-5",
  "thinkingMode": "standard",
  "maxTokenBudget": 50000,
  "parentSessionId": null,
  "service": "openclaw"
}

A tool failure log entry:

{
  "timestamp": "2026-04-26T09:24:11.543Z",
  "level": "error",
  "event": "tool_call_failed",
  "agentId": "agent_7f3k2m",
  "sessionId": "sess_9a1b3c",
  "tool": "web_fetch",
  "duration": 30001,
  "error": {
    "code": "TIMEOUT",
    "message": "Request exceeded 30s timeout",
    "retryCount": 2,
    "willRetry": false
  },
  "service": "openclaw"
}

41.3　Health Check Endpoint Analysis

The Gateway exposes a local health check endpoint for fast, point-in-time status assessment.

41.3.1　Endpoint Address

http://127.0.0.1:18789/health

By default this listens only on the loopback interface and is not exposed externally. To enable remote access for CI/CD or monitoring systems, configure explicitly:

{
  "gateway": {
    "health": {
      "host": "0.0.0.0",
      "port": 18789,
      "auth": {
        "enabled": true,
        "token": "${HEALTH_CHECK_TOKEN}"
      }
    }
  }
}

41.3.2　Full Response Field Analysis

{
  "status": "healthy",
  "version": "2026.4.22",
  "uptime": 86432,
  "gateway": {
    "status": "running",
    "pid": 12847,
    "port": 18789,
    "connections": {
      "active": 3,
      "idle": 12,
      "total": 58204
    }
  },
  "sessions": {
    "active": 2,
    "queued": 0,
    "completed24h": 147,
    "failed24h": 3,
    "successRate24h": 0.9796
  },
  "models": {
    "current": "claude-opus-4-5",
    "available": ["claude-opus-4-5", "claude-sonnet-4-5", "claude-haiku-4-5"],
    "lastModelCallMs": 1843
  },
  "resources": {
    "memoryUsedMB": 412,
    "memoryLimitMB": 2048,
    "cpuPercent": 12.4,
    "diskAvailableGB": 48.3
  },
  "tools": {
    "registered": 28,
    "errorsLastHour": 1,
    "errorRate1h": 0.012
  },
  "compaction": {
    "count24h": 7,
    "lastCompactionAt": "2026-04-26T08:41:12Z"
  },
  "timestamp": "2026-04-26T09:30:00Z"
}

Field monitoring value notes:

status: Top-level health state, one of healthy / degraded / unhealthy. degraded means some functionality is impaired but the Gateway is still operational
sessions.successRate24h: Rolling 24-hour Session success rate; alert when below 0.95
sessions.failed24h: Absolute failure count — combine with success rate to account for low-traffic periods where the rate may appear artificially high
resources.memoryUsedMB / memoryLimitMB: Memory usage ratio; above 85% warrants restart or limit increase
tools.errorRate1h: Tool error rate in the last hour; above 0.05 (5%) warrants investigation
compaction.count24h: Context Compaction event count; frequent triggering (>20/day) indicates a mismatch between task scale and token budget
models.lastModelCallMs: Last LLM call latency; useful for detecting API throttling

41.3.3　Using curl for Health Checks

# Basic status check
curl -s http://127.0.0.1:18789/health | jq '.status'

# Check Session success rate
curl -s http://127.0.0.1:18789/health | jq '.sessions.successRate24h'

# Check memory utilization percentage
curl -s http://127.0.0.1:18789/health | \
  jq '(.resources.memoryUsedMB / .resources.memoryLimitMB * 100 | round | tostring) + "% memory used"'

# Kubernetes readiness probe usage
curl -sf http://127.0.0.1:18789/health && echo "READY" || echo "NOT READY"

41.4　Monitoring Value of Key Log Fields

41.4.1　Field Index and Meaning

Field	Type	Monitoring Value
`timestamp`	ISO 8601 string	Millisecond precision; used for latency calculation and time-series correlation
`level`	Enum	Filter severe events; `error` level demands immediate attention
`agentId`	String	Track the behavioral pattern of a single Agent across Sessions
`sessionId`	String	Groups all logs for a single task; the core dimension for root cause analysis
`tool`	String	Identifies high-error-rate tools; pinpoints configuration or permission issues
`duration`	Integer (ms)	P99 latency analysis; detects tool performance degradation
`error.code`	String	Error classification (TIMEOUT / AUTH_FAIL / RATE_LIMIT, etc.)
`error.retryCount`	Integer	High retry counts indicate upstream instability
`tokenUsage.input`	Integer	Context growth rate; predicts when Compaction will trigger
`tokenUsage.output`	Integer	Model output volume; correlates with quality assessment

41.4.2　Combined Field Query Examples

Using jq to extract analytical value from log files:

# Count errors by tool
cat ~/.openclaw/logs/openclaw.log | \
  jq -r 'select(.event == "tool_call_failed") | .tool' | \
  sort | uniq -c | sort -rn

# Find the slowest tool calls (over 10 seconds)
cat ~/.openclaw/logs/openclaw.log | \
  jq 'select(.duration > 10000) | {tool, duration, sessionId}' | \
  jq -s 'sort_by(-.duration) | .[0:10]'

# Calculate total token consumption for a specific Session
SESSION_ID="sess_9a1b3c"
cat ~/.openclaw/logs/openclaw.log | \
  jq --arg sid "$SESSION_ID" \
  'select(.sessionId == $sid and .tokenUsage != null) | .tokenUsage.input + .tokenUsage.output' | \
  paste -sd+ | bc

# Find Sessions that failed repeatedly
cat ~/.openclaw/logs/openclaw.log | \
  jq -r 'select(.event == "session_failed") | .sessionId' | \
  awk 'seen[$0]++ {print $0 " appeared " seen[$0] " times"}' | head -10

41.5　10 Key Metrics and Reference Thresholds

41.5.1　Metric Definitions

Metric	Calculation	Warning Threshold	Critical Threshold	Notes
Session success rate	successes / total (24h rolling)	< 0.95	< 0.90	Core business health
Tool error rate	failed tool calls / total (1h)	> 0.03	> 0.08	Tool layer stability
P99 tool latency	99th percentile `duration` value	> 5000ms	> 15000ms	Tool call performance
Compaction frequency	count/24h per Agent	> 10	> 25	Token budget fit
LLM response time	last `modelCallMs`	> 3000ms	> 8000ms	API health status
Tokens per Session	average `tokenUsage.input + output`	> 40000	> 80000	Cost control
Memory utilization	`usedMB` / `limitMB`	> 0.75	> 0.90	Capacity planning
Gateway restarts	`event=gateway_restart` (24h)	> 1	> 3	Stability indicator
Active connections	`connections.active`	> 50	> 100	Concurrency capacity
High-retry tool calls	ratio with `retryCount > 2`	> 0.05	> 0.15	Upstream dependency stability

41.5.2　Business Interpretation of Metrics

Session success rate is the most important business indicator. Dropping below 95% typically means a commonly used tool has a configuration problem, or the LLM is unable to correctly interpret the instructions in SKILL.md. Prioritize investigating what changed recently.

Compaction frequency is a hidden performance signal. Every Compaction truncates the context, which can cause the Agent to forget earlier operation results, leading to duplicated work or logical errors. High frequency indicates that task design didn't adequately account for token budgets.

Average tokens per Session directly correlates with LLM API cost. Sessions exceeding 80,000 tokens may indicate inefficient context inflation — for example, injecting entire large files into context rather than using tools to read them on demand.

41.6　Integration with Prometheus / Grafana

OpenClaw's structured logs provide a natural integration point with existing monitoring stacks. The recommended approach uses prometheus-json-exporter to bridge the health check endpoint, and Loki or Vector to process structured logs.

41.6.1　prometheus-json-exporter Configuration

Installation:

# macOS
brew install prometheus-json-exporter

# Or using Docker
docker run -d -p 7979:7979 \
  -v $(pwd)/json_exporter.yml:/etc/json_exporter/config.yml \
  prometheuscommunity/json-exporter \
  --config.file /etc/json_exporter/config.yml

json_exporter.yml configuration (scraping the OpenClaw health check endpoint):

modules:
  openclaw:
    metrics:
      - name: openclaw_session_success_rate
        type: gauge
        help: "Session success rate over last 24 hours"
        path: "{ .sessions.successRate24h }"

      - name: openclaw_session_active
        type: gauge
        help: "Currently active sessions"
        path: "{ .sessions.active }"

      - name: openclaw_tool_error_rate_1h
        type: gauge
        help: "Tool error rate in the last hour"
        path: "{ .tools.errorRate1h }"

      - name: openclaw_memory_used_mb
        type: gauge
        help: "Gateway memory used in MB"
        path: "{ .resources.memoryUsedMB }"

      - name: openclaw_memory_limit_mb
        type: gauge
        help: "Gateway memory limit in MB"
        path: "{ .resources.memoryLimitMB }"

      - name: openclaw_compaction_count_24h
        type: gauge
        help: "Context compaction events in last 24 hours"
        path: "{ .compaction.count24h }"

      - name: openclaw_last_model_call_ms
        type: gauge
        help: "Last LLM API call latency in milliseconds"
        path: "{ .models.lastModelCallMs }"

      - name: openclaw_gateway_uptime_seconds
        type: counter
        help: "Gateway uptime in seconds"
        path: "{ .uptime }"

Add to Prometheus scrape_configs:

scrape_configs:
  - job_name: 'openclaw'
    metrics_path: /probe
    params:
      module: [openclaw]
      target: ['http://127.0.0.1:18789/health']
    static_configs:
      - targets: ['localhost:7979']
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: 'localhost:7979'

41.6.2　Vector Log Collection Configuration

Use Vector to ship OpenClaw's JSON logs to Loki or Elasticsearch:

# vector.toml

[sources.openclaw_logs]
type = "file"
include = ["~/.openclaw/logs/openclaw.log"]
read_from = "beginning"

[transforms.parse_openclaw]
type = "remap"
inputs = ["openclaw_logs"]
source = '''
  . = parse_json!(string!(.message))
  .source = "openclaw"
'''

[transforms.filter_errors]
type = "filter"
inputs = ["parse_openclaw"]
condition = '.level == "error" || .level == "warn"'

[sinks.loki]
type = "loki"
inputs = ["parse_openclaw"]
endpoint = "http://localhost:3100"
labels.job = "openclaw"
labels.level = "{{ level }}"
labels.agent_id = "{{ agentId }}"
encoding.codec = "json"

[sinks.error_alerts]
type = "http"
inputs = ["filter_errors"]
uri = "http://alertmanager:9093/api/v1/alerts"
encoding.codec = "json"

41.6.3　Grafana Dashboard Core Panels

Recommended 4-row Dashboard layout:

Row 1: Overview    (Session success rate / Active Sessions / Tool error rate / Gateway uptime)
Row 2: Performance (LLM response latency trend / P99 tool latency / Token consumption heatmap)
Row 3: Resources   (Memory utilization / Compaction frequency trend / Connection count)
Row 4: Error Detail(Error log stream / Error distribution by tool / High-retry calls)

Key PromQL queries:

# Session success rate (highlight when below 0.95)
openclaw_session_success_rate < 0.95

# Memory utilization ratio
openclaw_memory_used_mb / openclaw_memory_limit_mb

# Tool error rate change trend (5-minute average)
avg_over_time(openclaw_tool_error_rate_1h[5m])

# Is Gateway online (1 = online, 0 = offline)
up{job="openclaw"}

41.7　Alert Rule Design

41.7.1　Prometheus AlertManager Rules

# openclaw_alerts.yml
groups:
  - name: openclaw_critical
    rules:
      - alert: OpenClawGatewayDown
        expr: up{job="openclaw"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "OpenClaw Gateway is unreachable"
          description: "The health check endpoint has not responded for 1 minute. Check the Gateway process immediately."

      - alert: OpenClawSessionSuccessRateLow
        expr: openclaw_session_success_rate < 0.90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Session success rate critically low (current: {{ $value | humanize }})"
          description: "24-hour Session success rate has fallen below 90%, indicating possible systemic failure."

      - alert: OpenClawMemoryOverLimit
        expr: (openclaw_memory_used_mb / openclaw_memory_limit_mb) > 0.90
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Gateway memory usage exceeds 90%"
          description: "Current memory usage is {{ $value | humanizePercentage }}. OOM risk is high. Restart or scale up immediately."

  - name: openclaw_warning
    rules:
      - alert: OpenClawToolErrorRateHigh
        expr: openclaw_tool_error_rate_1h > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Tool error rate too high ({{ $value | humanizePercentage }})"
          description: "Tool call error rate exceeded 5% in the last hour. Check tool configuration and permissions."

      - alert: OpenClawCompactionFrequencyHigh
        expr: openclaw_compaction_count_24h > 20
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Context Compaction frequency too high ({{ $value }} times/day)"
          description: "More than 20 Compaction events per day may cause context loss and logical errors. Optimize token budget configuration."

      - alert: OpenClawLLMLatencyHigh
        expr: openclaw_last_model_call_ms > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM API response latency too high ({{ $value }}ms)"
          description: "Last LLM call took over 5 seconds. Possible API throttling or network issue."

      - alert: OpenClawSessionSuccessRateDegraded
        expr: openclaw_session_success_rate < 0.95
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Session success rate declining (current: {{ $value | humanize }})"
          description: "Success rate below 95%. Review recent failed Session logs."

41.7.2　Alert Notification Channel Configuration

# alertmanager.yml
route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 30m
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - routing_key: '${PAGERDUTY_KEY}'
        description: '{{ .CommonAnnotations.summary }}'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_URL}'
        channel: '#openclaw-alerts'
        title: '[{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

41.8　The openclaw doctor Command

openclaw doctor is the built-in diagnostic tool and should be the first command you run when an alert fires or when you suspect a configuration issue.

41.8.1　Doctor Check Items

openclaw gateway doctor

Typical output:

OpenClaw Diagnostic Report — 2026-04-26 09:30:00

[✓] Gateway process running (PID: 12847)
[✓] Gateway port 18789 accessible
[✓] Health check responding (status: healthy)
[✓] Configuration file valid (openclaw.json)
[✓] Model API reachable (claude-opus-4-5, latency: 1843ms)
[✓] Log directory writable (~/.openclaw/logs)
[✗] Disk space low: 2.1 GB available (threshold: 5 GB)
[✓] All registered tools (28) responding to ping
[!] Tool 'web_fetch' error rate 8.2% in last hour (threshold: 5%)
[✓] No active sessions in error state
[✓] Memory usage 20.1% (412/2048 MB)
[✓] Context compaction count: 7/day (threshold: 20)

Summary: 1 error, 1 warning
Action required: Free disk space; investigate web_fetch errors

41.8.2　Common doctor Subcommands

# Check only Gateway connectivity
openclaw gateway doctor --check connectivity

# Check configuration completeness of all tools
openclaw gateway doctor --check tools

# Output machine-readable JSON format
openclaw gateway doctor --format json

# Verbose mode (prints raw data for each check)
openclaw gateway doctor --verbose

# Check configuration for a specific Agent
openclaw gateway doctor --agent my-agent-id

41.9　Debugging Techniques

41.9.1　Dynamic Log Level Adjustment

Change log levels dynamically without restarting the Gateway:

# Switch to debug to capture full tool-call detail
openclaw config set logging.level debug

# Stream and pretty-print logs in real time
tail -f ~/.openclaw/logs/openclaw.log | jq '.'

# Filter to a specific Session
tail -f ~/.openclaw/logs/openclaw.log | \
  jq 'select(.sessionId == "sess_9a1b3c")'

# Show only errors and warnings
tail -f ~/.openclaw/logs/openclaw.log | \
  jq 'select(.level == "error" or .level == "warn")'

# Restore info level after debugging
openclaw config set logging.level info

41.9.2　Manual RPC Testing

The Gateway exposes RPC endpoints for manual tool testing:

# List registered tools
curl -X POST http://127.0.0.1:18789/rpc \
  -H "Content-Type: application/json" \
  -d '{"method": "tools/list", "id": 1}'

# Manually invoke a tool (testing the bash tool)
curl -X POST http://127.0.0.1:18789/rpc \
  -H "Content-Type: application/json" \
  -d '{
    "method": "tools/call",
    "params": {
      "name": "bash",
      "arguments": {"command": "echo hello"}
    },
    "id": 2
  }'

# Check current Session states
curl -s http://127.0.0.1:18789/sessions | jq '.'

41.9.3　Sandbox Mode Debugging

Use openclaw sandbox explain to analyze tool-call paths in a sandboxed environment without triggering real side effects:

# Run an Agent in sandbox mode and explain its decision-making
openclaw sandbox explain \
  --agent my-agent-id \
  --message "sort today's emails" \
  --show-tool-calls \
  --show-reasoning

41.10　Production Problem Diagnosis Workflow

41.10.1　Standard Diagnostic SOP

When an alert fires, follow this sequence:

Step 1: Confirm Gateway is online

curl -sf http://127.0.0.1:18789/health | jq '.status'
# If no response: openclaw gateway status
# If Gateway is stopped: openclaw gateway dashboard for history

Step 2: Run doctor

openclaw gateway doctor --format json > /tmp/doctor_$(date +%s).json
cat /tmp/doctor_*.json | jq '.issues'

Step 3: Review failed Session logs

# Find the most recently failed Session IDs
cat ~/.openclaw/logs/openclaw.log | \
  jq 'select(.event == "session_failed") | {sessionId, timestamp, error}' | \
  tail -5

# Extract the complete log for a specific Session
SESSION_ID="sess_xxxx"
cat ~/.openclaw/logs/openclaw.log | \
  jq --arg sid "$SESSION_ID" 'select(.sessionId == $sid)'

Step 4: Identify the root-cause tool

# Find which tool has the most errors
cat ~/.openclaw/logs/openclaw.log | \
  jq 'select(.event == "tool_call_failed") | .tool' | \
  sort | uniq -c | sort -rn | head -5

Step 5: Validate the fix

# Manually test the problematic tool
openclaw sandbox explain --message "test <tool-name> operation" --show-tool-calls

# Re-run doctor to confirm resolution
openclaw gateway doctor

41.10.2　Common Issue Quick Reference

Symptom	Most Likely Cause	Quick Verification Command
Gateway won't start	Port conflict or config file error	`openclaw gateway doctor --check config`
All Sessions timing out	LLM API key invalid or network issue	`openclaw models list`
Tool calls getting permission errors	Incomplete permission declaration in SKILL.md	`openclaw security audit`
Memory growing continuously	Large file content being injected into Context	Review `tokenUsage` trend
Frequent Compaction	`maxTokenBudget` set too low	Adjust `maxTokenBudget` parameter

The next chapter presents 10 real-world deployment scenarios, showing how OpenClaw is applied across different business contexts.

Rate this chapter

4.9 / 5 (3 ratings)

Monitoring and Observability: Structured Logs, Health Checks and Alerting Config

Chapter 41 Monitoring and Observability: Structured Logging, Health Checks, and Alert Configuration

41.1 The Observability Challenge for AI Agents

41.1.1 Why AI Agents Are Harder to Monitor

41.1.2 OpenClaw's Observability Design Philosophy

41.2 Logging Configuration in Detail

41.2.1 Complete Configuration Structure

41.2.2 Choosing the Right Log Level

41.2.3 Log Output Path Planning

41.2.4 Structured Log Field Examples

41.3 Health Check Endpoint Analysis

41.3.1 Endpoint Address

41.3.2 Full Response Field Analysis

41.3.3 Using curl for Health Checks

41.4 Monitoring Value of Key Log Fields

41.4.1 Field Index and Meaning

41.4.2 Combined Field Query Examples

41.5 10 Key Metrics and Reference Thresholds

41.5.1 Metric Definitions

41.5.2 Business Interpretation of Metrics

41.6 Integration with Prometheus / Grafana

41.6.1 prometheus-json-exporter Configuration

41.6.2 Vector Log Collection Configuration

41.6.3 Grafana Dashboard Core Panels

41.7 Alert Rule Design

41.7.1 Prometheus AlertManager Rules

41.7.2 Alert Notification Channel Configuration

41.8 The openclaw doctor Command

41.8.1 Doctor Check Items

41.8.2 Common doctor Subcommands

41.9 Debugging Techniques

41.9.1 Dynamic Log Level Adjustment

41.9.2 Manual RPC Testing

41.9.3 Sandbox Mode Debugging

41.10 Production Problem Diagnosis Workflow

41.10.1 Standard Diagnostic SOP

41.10.2 Common Issue Quick Reference

💬 Comments