Description

Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.

README (SKILL.md)

Logging & Observability

Name: Logging Observability
Author: wpank

Patterns for building observable systems across the three pillars: logs, metrics, and traces.

Three Pillars

Pillar	Purpose	Question It Answers	Example
Logs	What happened	Why did this request fail?	`{"level":"error","msg":"payment declined","user_id":"u_82"}`
Metrics	How much / how fast	Is latency increasing?	`http_request_duration_seconds{route="/api/orders"} 0.342`
Traces	Request flow	Where is the bottleneck?	Span: `api-gateway → auth → order-service → db`

Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.

Structured Logging

Always emit logs as structured JSON — never free-text strings.

Required Fields

Field	Purpose	Required
`timestamp`	ISO-8601 with milliseconds	Yes
`level`	Severity (DEBUG … FATAL)	Yes
`service`	Originating service name	Yes
`message`	Human-readable description	Yes
`trace_id`	Distributed trace correlation	Yes
`span_id`	Current span within trace	Yes
`correlation_id`	Business-level correlation (order ID)	When applicable
`error`	Structured error object	On errors
`context`	Request-specific metadata	Recommended

Context Enrichment

Attach context at the middleware level so downstream logs inherit automatically:

app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    user_id: req.user?.id,
    method: req.method,
    path: req.path,
  };
  asyncLocalStorage.run(ctx, () => next());
});

Library Recommendations

Library	Language	Strengths	Perf
Pino	Node.js	Fastest Node logger, low overhead	Excellent
structlog	Python	Composable processors, context binding	Good
zerolog	Go	Zero-allocation JSON logging	Excellent
zap	Go	High performance, typed fields	Excellent
tracing	Rust	Spans + events, async-aware	Excellent

Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.

Log Levels

Level	When to Use	Example
FATAL	App cannot continue, process will exit	Database connection pool exhausted
ERROR	Operation failed, needs attention	Payment charge failed: CARD_DECLINED
WARN	Unexpected but recoverable	Retry 2/3 for upstream timeout
INFO	Normal business events	Order ORD-1234 placed successfully
DEBUG	Developer troubleshooting	Cache miss for key user:82:preferences
TRACE	Very fine-grained (rarely in prod)	Entering validateAddress with payload

Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.

Distributed Tracing

OpenTelemetry Setup

Always prefer OpenTelemetry over vendor-specific SDKs:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Span Creation

const tracer = trace.getTracer('order-service');

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.total_cents', order.totalCents);
      await validateInventory(order);
      await chargePayment(order);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Context Propagation

Use W3C Trace Context (traceparent header) — default in OTel
Propagate across HTTP, gRPC, and message queues
For async workers: serialise traceparent into the job payload

Trace Sampling

Strategy	Use When
Always On	Low-traffic services, debugging
Probabilistic (N%)	General production use
Rate-limited (N/sec)	High-throughput services
Tail-based	When you need all error traces

Always sample 100% of error traces regardless of strategy.

Metrics Collection

RED Method (Request-Driven)

Monitor these three for every service endpoint:

Metric	What It Measures	Prometheus Example
Rate	Requests/sec	`rate(http_requests_total[5m])`
Errors	Failed request ratio	`rate(http_requests_total{status=~"5.."}[5m])`
Duration	Response time	`histogram_quantile(0.99, http_request_duration_seconds)`

USE Method (Resource-Driven)

For infrastructure components (CPU, memory, disk, network):

Metric	What It Measures	Example
Utilization	% resource busy	CPU usage at 78%
Saturation	Work queued/waiting	12 requests queued in thread pool
Errors	Error events on resource	3 disk I/O errors in last minute

Monitoring Stack

Tool	Category	Best For
Prometheus	Metrics	Pull-based metrics, alerting rules
Grafana	Visualisation	Dashboards for metrics, logs, traces
Jaeger	Tracing	Distributed trace visualisation
Loki	Logs	Log aggregation (pairs with Grafana)
OpenTelemetry	Collection	Vendor-neutral telemetry collection

Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.

Alert Design

Severity Levels

Severity	Response Time	Example
P1	Immediate	Service fully down, data loss
P2	\x3C 30 min	Error rate > 5%, latency p99 > 5s
P3	Business hours	Disk > 80%, cert expiring in 7 days
P4	Best effort	Non-critical deprecation warning

Alert Fatigue Prevention

Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
Require runbook links — every alert must link to diagnosis and remediation
Review monthly — delete or tune alerts that never fire or always fire
Group related alerts — use inhibition rules to suppress child alerts
Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete

Dashboard Patterns

Overview Dashboard ("War Room")

Total requests/sec across all services
Global error rate (%) with trendline
p50 / p95 / p99 latency
Active alerts count by severity
Deployment markers overlaid on graphs

Service Dashboard (Per-Service)

RED metrics for each endpoint
Dependency health (upstream/downstream success rates)
Resource utilisation (CPU, memory, connections)
Top errors table with count and last seen

Observability Checklist

Every service must have:

Structured JSON logging with consistent schema
Correlation / trace IDs propagated on all requests
RED metrics exposed for every external endpoint
Health check endpoints (/healthz and /readyz)
Distributed tracing with OpenTelemetry
Dashboards for RED metrics and resource utilisation
Alerts for error rate, latency, and saturation with runbook links
Log level configurable at runtime without redeployment
PII scrubbing verified and tested
Retention policies defined for logs, metrics, and traces

Anti-Patterns

Anti-Pattern	Problem	Fix
Logging PII	Privacy/compliance violation	Mask or exclude PII; use token references
Excessive logging	Storage costs balloon, signal drowns	Log business events, not data flow
Unstructured logs	Cannot query or alert on fields	Use structured JSON with consistent schema
String interpolation	Breaks structured fields, injection risk	Pass fields as metadata, not in message
Missing correlation IDs	Cannot trace across services	Generate and propagate trace_id everywhere
Alert storms	On-call fatigue, real issues buried	Use grouping, inhibition, deduplication
Metrics with high cardinality	Prometheus OOM, dashboard timeouts	Never use user ID or request ID as label

NEVER Do

NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
NEVER use console.log / print in production — use a structured logger
NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
NEVER create alerts without a runbook link — unactionable alerts erode trust
NEVER rely on logs alone — you need metrics and traces for full observability
NEVER log request/response bodies by default — opt-in only, with PII redaction
NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
NEVER skip context propagation in async flows — broken traces are worse than no traces

Usage Guidance

This skill appears coherent and low-risk: it only provides code patterns and recommendations for logging, tracing, and metrics and does not request secrets or system access. Before using: 1) review and adapt the code snippets to your environment (don’t paste secrets into logs), 2) ensure your OTLP/Prometheus endpoints are internal and authenticated as needed, and 3) if you follow the README install hints, verify the source URL and avoid running unfamiliar install commands. If you need higher assurance, ask the publisher for a canonical repository/homepage (the skill's source/homepage are unknown).

Capability Analysis

Type: OpenClaw Skill Name: logging-observability Version: 0.1.0 The skill provides comprehensive guidance on observability best practices, including structured logging, distributed tracing with OpenTelemetry, and metrics collection. It explicitly warns against logging sensitive data like PII and secrets, and offers standard installation instructions. No malicious code, data exfiltration, or prompt injection attempts against the AI agent were identified in `_meta.json`, `SKILL.md`, or `README.md`.

Capability Assessment

✓ Purpose & Capability

The name/description (logging, tracing, metrics) match the SKILL.md content. The instructions only reference logging libraries, OpenTelemetry, Prometheus/Grafana, and code-level instrumentation — all appropriate for an observability skill. There are no unrelated required env vars, binaries, or config paths.

✓ Instruction Scope

SKILL.md contains implementation patterns and code snippets (middleware for context enrichment, OTel setup, spans, sampling, RED/USE metrics). The snippets reference request headers and user IDs, which is expected for correlating traces/logs. The instructions explicitly recommend PII/secret scrubbing. There are no directives to read arbitrary system files, export unrelated credentials, or post data to unknown external endpoints (OTLP exporter URL points to an internal collector hostname, which is typical).

ℹ Install Mechanism

This is instruction-only (no install spec, no code files), which is lower risk. README includes example installation commands (npx add with a GitHub tree URL and manual copy instructions). Those README install hints are informal and the npx URL (a GitHub tree path) may not work as-is; since the skill has no formal install step and no remote downloads, there is no execution-time install risk from the skill itself. Still, verify any external installation commands before running them.

✓ Credentials

The skill declares no required environment variables, credentials, or config paths. The SKILL.md does not instruct the agent to read environment secrets or request unrelated tokens. This is proportionate for an observability guidance skill.

✓ Persistence & Privilege

The skill does not request persistent presence (always: false) and has no install-time components that modify agent/system-wide configs. Agent autonomous invocation is allowed (platform default) but not combined with other red flags.

Version History

v0.1.0

Initial release of logging-observability, providing comprehensive patterns for building observable systems. - Covers structured JSON logging, distributed tracing with OpenTelemetry, and metrics collection (RED/USE methods). - Provides guidelines for log levels, context enrichment, and sensitive data scrubbing. - Includes recommended logging libraries and monitoring stack (Prometheus, Grafana, Loki, Jaeger). - Offers alerting best practices and dashboard design patterns for production systems.

Metadata

Slug logging-observability

Version 0.1.0

License —

All-time Installs 12

Active Installs 12

Total Versions 1

Frequently Asked Questions

What is Logging Observability?

Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing. It is an AI Agent Skill for Claude Code / OpenClaw, with 1789 downloads so far.

How do I install Logging Observability?

Run "/install logging-observability" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Logging Observability free?

Yes, Logging Observability is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Logging Observability support?

Logging Observability is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Logging Observability?

It is built and maintained by wpank (@wpank); the current version is v0.1.0.

More Skills

Logging Observability