← 返回 Skills 市场
wpank

Logging Observability

作者 wpank · GitHub ↗ · v0.1.0
cross-platform ✓ 安全检测通过
1789
总下载
1
收藏
12
当前安装
1
版本数
在 OpenClaw 中安装
/install logging-observability
功能描述
Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.
使用说明 (SKILL.md)

Logging & Observability

Patterns for building observable systems across the three pillars: logs, metrics, and traces.

Three Pillars

Pillar Purpose Question It Answers Example
Logs What happened Why did this request fail? {"level":"error","msg":"payment declined","user_id":"u_82"}
Metrics How much / how fast Is latency increasing? http_request_duration_seconds{route="/api/orders"} 0.342
Traces Request flow Where is the bottleneck? Span: api-gateway → auth → order-service → db

Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.


Structured Logging

Always emit logs as structured JSON — never free-text strings.

Required Fields

Field Purpose Required
timestamp ISO-8601 with milliseconds Yes
level Severity (DEBUG … FATAL) Yes
service Originating service name Yes
message Human-readable description Yes
trace_id Distributed trace correlation Yes
span_id Current span within trace Yes
correlation_id Business-level correlation (order ID) When applicable
error Structured error object On errors
context Request-specific metadata Recommended

Context Enrichment

Attach context at the middleware level so downstream logs inherit automatically:

app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    user_id: req.user?.id,
    method: req.method,
    path: req.path,
  };
  asyncLocalStorage.run(ctx, () => next());
});

Library Recommendations

Library Language Strengths Perf
Pino Node.js Fastest Node logger, low overhead Excellent
structlog Python Composable processors, context binding Good
zerolog Go Zero-allocation JSON logging Excellent
zap Go High performance, typed fields Excellent
tracing Rust Spans + events, async-aware Excellent

Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.


Log Levels

Level When to Use Example
FATAL App cannot continue, process will exit Database connection pool exhausted
ERROR Operation failed, needs attention Payment charge failed: CARD_DECLINED
WARN Unexpected but recoverable Retry 2/3 for upstream timeout
INFO Normal business events Order ORD-1234 placed successfully
DEBUG Developer troubleshooting Cache miss for key user:82:preferences
TRACE Very fine-grained (rarely in prod) Entering validateAddress with payload

Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.


Distributed Tracing

OpenTelemetry Setup

Always prefer OpenTelemetry over vendor-specific SDKs:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Span Creation

const tracer = trace.getTracer('order-service');

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.total_cents', order.totalCents);
      await validateInventory(order);
      await chargePayment(order);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Context Propagation

  • Use W3C Trace Context (traceparent header) — default in OTel
  • Propagate across HTTP, gRPC, and message queues
  • For async workers: serialise traceparent into the job payload

Trace Sampling

Strategy Use When
Always On Low-traffic services, debugging
Probabilistic (N%) General production use
Rate-limited (N/sec) High-throughput services
Tail-based When you need all error traces

Always sample 100% of error traces regardless of strategy.


Metrics Collection

RED Method (Request-Driven)

Monitor these three for every service endpoint:

Metric What It Measures Prometheus Example
Rate Requests/sec rate(http_requests_total[5m])
Errors Failed request ratio rate(http_requests_total{status=~"5.."}[5m])
Duration Response time histogram_quantile(0.99, http_request_duration_seconds)

USE Method (Resource-Driven)

For infrastructure components (CPU, memory, disk, network):

Metric What It Measures Example
Utilization % resource busy CPU usage at 78%
Saturation Work queued/waiting 12 requests queued in thread pool
Errors Error events on resource 3 disk I/O errors in last minute

Monitoring Stack

Tool Category Best For
Prometheus Metrics Pull-based metrics, alerting rules
Grafana Visualisation Dashboards for metrics, logs, traces
Jaeger Tracing Distributed trace visualisation
Loki Logs Log aggregation (pairs with Grafana)
OpenTelemetry Collection Vendor-neutral telemetry collection

Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.


Alert Design

Severity Levels

Severity Response Time Example
P1 Immediate Service fully down, data loss
P2 \x3C 30 min Error rate > 5%, latency p99 > 5s
P3 Business hours Disk > 80%, cert expiring in 7 days
P4 Best effort Non-critical deprecation warning

Alert Fatigue Prevention

  • Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
  • Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
  • Require runbook links — every alert must link to diagnosis and remediation
  • Review monthly — delete or tune alerts that never fire or always fire
  • Group related alerts — use inhibition rules to suppress child alerts
  • Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete

Dashboard Patterns

Overview Dashboard ("War Room")

  • Total requests/sec across all services
  • Global error rate (%) with trendline
  • p50 / p95 / p99 latency
  • Active alerts count by severity
  • Deployment markers overlaid on graphs

Service Dashboard (Per-Service)

  • RED metrics for each endpoint
  • Dependency health (upstream/downstream success rates)
  • Resource utilisation (CPU, memory, connections)
  • Top errors table with count and last seen

Observability Checklist

Every service must have:

  • Structured JSON logging with consistent schema
  • Correlation / trace IDs propagated on all requests
  • RED metrics exposed for every external endpoint
  • Health check endpoints (/healthz and /readyz)
  • Distributed tracing with OpenTelemetry
  • Dashboards for RED metrics and resource utilisation
  • Alerts for error rate, latency, and saturation with runbook links
  • Log level configurable at runtime without redeployment
  • PII scrubbing verified and tested
  • Retention policies defined for logs, metrics, and traces

Anti-Patterns

Anti-Pattern Problem Fix
Logging PII Privacy/compliance violation Mask or exclude PII; use token references
Excessive logging Storage costs balloon, signal drowns Log business events, not data flow
Unstructured logs Cannot query or alert on fields Use structured JSON with consistent schema
String interpolation Breaks structured fields, injection risk Pass fields as metadata, not in message
Missing correlation IDs Cannot trace across services Generate and propagate trace_id everywhere
Alert storms On-call fatigue, real issues buried Use grouping, inhibition, deduplication
Metrics with high cardinality Prometheus OOM, dashboard timeouts Never use user ID or request ID as label

NEVER Do

  1. NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
  2. NEVER use console.log / print in production — use a structured logger
  3. NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
  4. NEVER create alerts without a runbook link — unactionable alerts erode trust
  5. NEVER rely on logs alone — you need metrics and traces for full observability
  6. NEVER log request/response bodies by default — opt-in only, with PII redaction
  7. NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
  8. NEVER skip context propagation in async flows — broken traces are worse than no traces
安全使用建议
This skill appears coherent and low-risk: it only provides code patterns and recommendations for logging, tracing, and metrics and does not request secrets or system access. Before using: 1) review and adapt the code snippets to your environment (don’t paste secrets into logs), 2) ensure your OTLP/Prometheus endpoints are internal and authenticated as needed, and 3) if you follow the README install hints, verify the source URL and avoid running unfamiliar install commands. If you need higher assurance, ask the publisher for a canonical repository/homepage (the skill's source/homepage are unknown).
功能分析
Type: OpenClaw Skill Name: logging-observability Version: 0.1.0 The skill provides comprehensive guidance on observability best practices, including structured logging, distributed tracing with OpenTelemetry, and metrics collection. It explicitly warns against logging sensitive data like PII and secrets, and offers standard installation instructions. No malicious code, data exfiltration, or prompt injection attempts against the AI agent were identified in `_meta.json`, `SKILL.md`, or `README.md`.
能力评估
Purpose & Capability
The name/description (logging, tracing, metrics) match the SKILL.md content. The instructions only reference logging libraries, OpenTelemetry, Prometheus/Grafana, and code-level instrumentation — all appropriate for an observability skill. There are no unrelated required env vars, binaries, or config paths.
Instruction Scope
SKILL.md contains implementation patterns and code snippets (middleware for context enrichment, OTel setup, spans, sampling, RED/USE metrics). The snippets reference request headers and user IDs, which is expected for correlating traces/logs. The instructions explicitly recommend PII/secret scrubbing. There are no directives to read arbitrary system files, export unrelated credentials, or post data to unknown external endpoints (OTLP exporter URL points to an internal collector hostname, which is typical).
Install Mechanism
This is instruction-only (no install spec, no code files), which is lower risk. README includes example installation commands (npx add with a GitHub tree URL and manual copy instructions). Those README install hints are informal and the npx URL (a GitHub tree path) may not work as-is; since the skill has no formal install step and no remote downloads, there is no execution-time install risk from the skill itself. Still, verify any external installation commands before running them.
Credentials
The skill declares no required environment variables, credentials, or config paths. The SKILL.md does not instruct the agent to read environment secrets or request unrelated tokens. This is proportionate for an observability guidance skill.
Persistence & Privilege
The skill does not request persistent presence (always: false) and has no install-time components that modify agent/system-wide configs. Agent autonomous invocation is allowed (platform default) but not combined with other red flags.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install logging-observability
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /logging-observability 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.1.0
Initial release of logging-observability, providing comprehensive patterns for building observable systems. - Covers structured JSON logging, distributed tracing with OpenTelemetry, and metrics collection (RED/USE methods). - Provides guidelines for log levels, context enrichment, and sensitive data scrubbing. - Includes recommended logging libraries and monitoring stack (Prometheus, Grafana, Loki, Jaeger). - Offers alerting best practices and dashboard design patterns for production systems.
元数据
Slug logging-observability
版本 0.1.0
许可证
累计安装 12
当前安装数 12
历史版本数 1
常见问题

Logging Observability 是什么?

Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 1789 次。

如何安装 Logging Observability?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install logging-observability」即可一键安装,无需额外配置。

Logging Observability 是免费的吗?

是的,Logging Observability 完全免费(开源免费),可自由下载、安装和使用。

Logging Observability 支持哪些平台?

Logging Observability 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Logging Observability?

由 wpank(@wpank)开发并维护,当前版本 v0.1.0。

💬 留言讨论