第 55 章
私有/团队 Marketplace:marketplace.json 与 5 种 Plugin Source 类型
第五十五章:监控与可观测性插件:追踪 Claude 的每一次决策
55.1 为什么 Claude 需要可观测性
在生产环境中部署 Claude 面临一个独特的挑战:Claude 的推理过程是黑盒。当 Claude 做出一个错误的决策——调用了错误的工具、生成了质量低下的回复、消耗了超出预期的 token——如果没有可观测性基础设施,你只能看到输入和输出,对中间发生的一切一无所知。
可观测性(Observability)这个概念来自控制论:一个系统的"可观测性"指的是你能从系统的外部输出推断出系统内部状态的程度。对于 Claude 来说,可观测性插件要解答的核心问题是:
- What(是什么):Claude 做了哪些决策?调用了哪些工具?
- Why(为什么):Claude 为什么选择这个工具而不是那个工具?
- How long(多久):每一步耗时多少?瓶颈在哪里?
- How much(多少钱):这次对话消耗了多少 token?成本是多少?
- What went wrong(哪里出错):错误发生在哪一步?根因是什么?
55.2 可观测性的三个支柱
参考 OpenTelemetry 的设计哲学,Claude 的可观测性同样可以分为三个维度:
指标(Metrics)
时序数据,反映系统的聚合状态。典型的 Claude 指标:
claude.session.count 每分钟新建会话数
claude.tool.calls_total 工具调用总次数(按工具名分组)
claude.tool.latency_p99 工具调用 P99 延迟
claude.tokens.input 输入 token 累计数量
claude.tokens.output 输出 token 累计数量
claude.cost.usd API 调用成本(美元)
claude.error.rate 工具调用失败率
claude.response.latency 首 token 到达时间
日志(Logs)
结构化的事件记录,反映单次操作的详情:
{
"timestamp": "2026-04-28T10:23:45.123Z",
"level": "INFO",
"event": "tool_call",
"sessionId": "sess_abc123",
"toolName": "query_database",
"input": { "sql": "SELECT COUNT(*) FROM users WHERE...", "limit": 100 },
"output": { "rows": 1, "data": [{ "count": 42891 }] },
"latencyMs": 234,
"inputTokens": 1847,
"outputTokens": 312
}
链路追踪(Traces)
跨多个工具调用的完整执行链路,反映一次会话中的因果关系:
Session sess_abc123 [3.4s]
├─ LLM Inference #1 [1.2s] - 决策:调用 query_database
│ └─ tool: query_database [234ms]
├─ LLM Inference #2 [0.8s] - 决策:调用 analyze_results
│ └─ tool: analyze_results [412ms]
└─ LLM Inference #3 [1.1s] - 生成最终回复
55.3 监控 Plugin 的架构
monitoring-plugin/
├── plugin.json
├── hooks/
│ ├── pre-tool.ts ← 记录工具调用开始
│ ├── post-tool.ts ← 记录工具调用结束、计算延迟
│ ├── pre-response.ts ← 记录推理开始
│ └── post-response.ts ← 记录推理结束、token 消耗
├── monitor/
│ └── collector.ts ← 指标收集和上报
├── exporters/
│ ├── opentelemetry.ts ← OpenTelemetry 导出
│ ├── datadog.ts ← Datadog 导出
│ └── file.ts ← 本地文件导出
└── dashboard/
└── queries.md ← 预设查询(for Grafana/Datadog)
55.4 实现完整的监控 Plugin
plugin.json
{
"name": "claude-observability",
"version": "1.0.0",
"description": "Full observability plugin for Claude Code: metrics, logs, and traces",
"author": "Platform Team <[email protected]>",
"config": {
"schema": {
"exportTarget": {
"type": "string",
"description": "Where to export telemetry",
"enum": ["console", "file", "opentelemetry", "datadog"],
"default": "console"
},
"otlpEndpoint": {
"type": "string",
"description": "OpenTelemetry collector endpoint",
"default": "http://localhost:4318"
},
"datadogApiKey": {
"type": "string",
"description": "Datadog API key",
"secret": true
},
"samplingRate": {
"type": "number",
"description": "Trace sampling rate (0.0 to 1.0)",
"default": 1.0,
"minimum": 0,
"maximum": 1
},
"logLevel": {
"type": "string",
"enum": ["debug", "info", "warn", "error"],
"default": "info"
}
}
},
"hooks": {
"preToolCall": "./dist/hooks/pre-tool.js",
"postToolCall": "./dist/hooks/post-tool.js",
"preResponse": "./dist/hooks/pre-response.js",
"postResponse": "./dist/hooks/post-response.js",
"sessionStart": "./dist/hooks/session-start.js",
"sessionEnd": "./dist/hooks/session-end.js"
},
"monitor": {
"collector": "./dist/monitor/collector.js",
"sampling": 1.0
}
}
核心数据结构
// types/telemetry.ts
export interface SessionContext {
sessionId: string;
userId?: string;
projectId?: string;
model: string;
startTime: Date;
spans: SpanContext[];
}
export interface SpanContext {
spanId: string;
parentSpanId?: string;
operation: string;
startTime: Date;
endTime?: Date;
attributes: Record<string, string | number | boolean>;
events: SpanEvent[];
status: "ok" | "error" | "running";
errorMessage?: string;
}
export interface SpanEvent {
timestamp: Date;
name: string;
attributes?: Record<string, unknown>;
}
export interface TokenUsage {
inputTokens: number;
outputTokens: number;
cacheReadTokens: number;
cacheWriteTokens: number;
costUsd: number;
}
会话追踪器
// monitor/session-tracker.ts
import type { SessionContext, SpanContext } from "../types/telemetry.js";
import { randomUUID } from "crypto";
class SessionTracker {
private sessions = new Map<string, SessionContext>();
startSession(sessionId: string, model: string, userId?: string): SessionContext {
const session: SessionContext = {
sessionId,
userId,
model,
startTime: new Date(),
spans: [],
};
this.sessions.set(sessionId, session);
return session;
}
endSession(sessionId: string): SessionContext | undefined {
const session = this.sessions.get(sessionId);
this.sessions.delete(sessionId);
return session;
}
startSpan(
sessionId: string,
operation: string,
parentSpanId?: string
): SpanContext {
const span: SpanContext = {
spanId: randomUUID(),
parentSpanId,
operation,
startTime: new Date(),
attributes: {},
events: [],
status: "running",
};
const session = this.sessions.get(sessionId);
if (session) {
session.spans.push(span);
}
return span;
}
endSpan(
span: SpanContext,
status: "ok" | "error",
attributes?: Record<string, string | number | boolean>
): void {
span.endTime = new Date();
span.status = status;
if (attributes) {
Object.assign(span.attributes, attributes);
}
}
getSession(sessionId: string): SessionContext | undefined {
return this.sessions.get(sessionId);
}
}
export const sessionTracker = new SessionTracker();
Hook 实现:工具调用前
// hooks/pre-tool.ts
import type { PreToolCallHook, HookContext } from "@claude/plugin-sdk";
import { sessionTracker } from "../monitor/session-tracker.js";
// 用于关联 pre 和 post hook 的 span 存储
export const activeSpans = new Map<string, ReturnType<typeof sessionTracker.startSpan>>();
export const preToolCall: PreToolCallHook = async (
toolName: string,
toolInput: Record<string, unknown>,
context: HookContext
) => {
const sessionId = context.session.id;
// 开始工具调用 Span
const span = sessionTracker.startSpan(
sessionId,
`tool:${toolName}`,
context.currentSpanId
);
span.attributes.toolName = toolName;
span.attributes.inputSize = JSON.stringify(toolInput).length;
// 记录工具输入摘要(不记录完整数据以避免泄露敏感信息)
span.events.push({
timestamp: new Date(),
name: "tool.input",
attributes: {
toolName,
inputKeys: Object.keys(toolInput).join(","),
},
});
// 存储 span 以便 post hook 使用
activeSpans.set(`${sessionId}:${toolName}:${span.spanId}`, span);
// 将 spanId 注入上下文,供 post hook 检索
context.setMetadata("currentSpanId", span.spanId);
return { action: "allow" };
};
Hook 实现:工具调用后
// hooks/post-tool.ts
import type { PostToolCallHook, HookContext } from "@claude/plugin-sdk";
import { sessionTracker } from "../monitor/session-tracker.js";
import { activeSpans } from "./pre-tool.js";
import { metricsCollector } from "../monitor/metrics.js";
export const postToolCall: PostToolCallHook = async (
toolName: string,
toolInput: Record<string, unknown>,
toolResult: unknown,
context: HookContext
) => {
const sessionId = context.session.id;
const spanId = context.getMetadata("currentSpanId") as string;
if (!spanId) return { action: "allow" };
const spanKey = `${sessionId}:${toolName}:${spanId}`;
const span = activeSpans.get(spanKey);
if (span) {
const latencyMs = Date.now() - span.startTime.getTime();
const success = !(toolResult as Record<string, unknown>)?.isError;
sessionTracker.endSpan(span, success ? "ok" : "error", {
latencyMs,
outputSize: JSON.stringify(toolResult).length,
success,
});
activeSpans.delete(spanKey);
// 更新指标
metricsCollector.increment("claude.tool.calls_total", {
tool: toolName,
success: String(success),
sessionId,
});
metricsCollector.histogram("claude.tool.latency_ms", latencyMs, {
tool: toolName,
});
if (!success) {
const errorMsg = (toolResult as Record<string, unknown>)?.error as string;
metricsCollector.increment("claude.tool.errors_total", {
tool: toolName,
error: errorMsg?.substring(0, 50) ?? "unknown",
});
}
}
return { action: "allow" };
};
Hook 实现:回复后(token 计费)
// hooks/post-response.ts
import type { PostResponseHook, HookContext } from "@claude/plugin-sdk";
import { metricsCollector } from "../monitor/metrics.js";
// Claude API 定价(每 1M tokens)
const PRICING = {
"claude-3-5-sonnet": { input: 3.0, output: 15.0 },
"claude-3-5-haiku": { input: 0.8, output: 4.0 },
"claude-3-opus": { input: 15.0, output: 75.0 },
};
export const postResponse: PostResponseHook = async (
response: string,
context: HookContext
) => {
const usage = context.usage;
const model = context.session.model;
if (!usage) return { action: "allow" };
const pricing = PRICING[model as keyof typeof PRICING];
const costUsd = pricing
? (usage.inputTokens / 1_000_000) * pricing.input +
(usage.outputTokens / 1_000_000) * pricing.output
: 0;
// 记录 token 使用指标
metricsCollector.add("claude.tokens.input", usage.inputTokens, {
model,
sessionId: context.session.id,
});
metricsCollector.add("claude.tokens.output", usage.outputTokens, {
model,
sessionId: context.session.id,
});
metricsCollector.add("claude.cost.usd", costUsd, {
model,
sessionId: context.session.id,
});
return { action: "allow" };
};
指标收集器实现
// monitor/metrics.ts
interface MetricPoint {
name: string;
value: number;
labels: Record<string, string>;
timestamp: Date;
type: "counter" | "gauge" | "histogram";
}
class MetricsCollector {
private buffer: MetricPoint[] = [];
private flushInterval: NodeJS.Timeout;
constructor(private readonly exporter: MetricExporter, flushMs = 5000) {
this.flushInterval = setInterval(() => this.flush(), flushMs);
}
increment(name: string, labels: Record<string, string> = {}, value = 1): void {
this.buffer.push({ name, value, labels, timestamp: new Date(), type: "counter" });
}
add(name: string, value: number, labels: Record<string, string> = {}): void {
this.buffer.push({ name, value, labels, timestamp: new Date(), type: "counter" });
}
histogram(name: string, value: number, labels: Record<string, string> = {}): void {
this.buffer.push({ name, value, labels, timestamp: new Date(), type: "histogram" });
}
private async flush(): Promise<void> {
if (this.buffer.length === 0) return;
const points = this.buffer.splice(0, this.buffer.length);
await this.exporter.export(points);
}
destroy(): void {
clearInterval(this.flushInterval);
this.flush();
}
}
55.5 OpenTelemetry 导出
OpenTelemetry 是可观测性领域的行业标准,支持将数据导出到 Jaeger、Grafana Tempo、Datadog、New Relic 等后端。
// exporters/opentelemetry.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import type { SessionContext } from "../types/telemetry.js";
export class OpenTelemetryExporter {
private sdk: NodeSDK;
constructor(endpoint: string, serviceName: string) {
this.sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: `${endpoint}/v1/traces`,
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: `${endpoint}/v1/metrics`,
}),
exportIntervalMillis: 10000,
}),
serviceName,
});
this.sdk.start();
}
exportSession(session: SessionContext): void {
// 将 SessionContext 转换为 OpenTelemetry Spans
for (const span of session.spans) {
const otelSpan = tracer.startSpan(span.operation, {
startTime: span.startTime,
attributes: {
"claude.session.id": session.sessionId,
"claude.model": session.model,
...span.attributes,
},
});
for (const event of span.events) {
otelSpan.addEvent(event.name, event.attributes, event.timestamp);
}
otelSpan.setStatus(
span.status === "ok" ? { code: 1 } : { code: 2, message: span.errorMessage }
);
otelSpan.end(span.endTime ?? new Date());
}
}
}
55.6 成本追踪仪表板
预设查询(Grafana Dashboard JSON 片段)
{
"title": "Claude Cost Tracker",
"panels": [
{
"title": "Daily Cost (USD)",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(claude_cost_usd_total[1h])) by (model) * 24",
"legendFormat": "{{model}}"
}
]
},
{
"title": "Tool Call Latency P99",
"type": "gauge",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(claude_tool_latency_ms_bucket[5m]))",
"legendFormat": "P99 Latency (ms)"
}
]
},
{
"title": "Error Rate by Tool",
"type": "barchart",
"targets": [
{
"expr": "sum(rate(claude_tool_errors_total[1h])) by (tool) / sum(rate(claude_tool_calls_total[1h])) by (tool)"
}
]
}
]
}
55.7 决策审计日志
在合规场景中,不只是指标和链路,Claude 为什么做了某个决定也需要可审计。
// monitor/audit-logger.ts
import type { DecisionEvent } from "@claude/plugin-sdk";
interface AuditRecord {
timestamp: string;
sessionId: string;
userId: string;
decisionType: "tool_call" | "tool_skip" | "response_generated" | "safety_block";
toolName?: string;
reasoning?: string; // Claude 的 thinking 摘要(如果启用了 extended thinking)
inputHash: string; // 输入内容的哈希(不存储原始内容,保护隐私)
outputHash?: string;
metadata: Record<string, unknown>;
}
export class AuditLogger {
async logDecision(event: DecisionEvent, session: HookContext["session"]): Promise<void> {
const record: AuditRecord = {
timestamp: new Date().toISOString(),
sessionId: session.id,
userId: session.userId ?? "anonymous",
decisionType: event.type,
toolName: event.toolName,
reasoning: event.thinking?.substring(0, 500), // 截取前 500 字符
inputHash: hashContent(JSON.stringify(event.input)),
outputHash: event.output ? hashContent(JSON.stringify(event.output)) : undefined,
metadata: {
model: session.model,
turnIndex: session.turnIndex,
},
};
// 写入审计日志(不可变存储)
await this.writeAuditRecord(record);
}
private async writeAuditRecord(record: AuditRecord): Promise<void> {
// 实际实现中应写入 append-only 存储
// 例如:AWS S3 with Object Lock,或 PostgreSQL with insert-only policy
const line = JSON.stringify(record) + "\n";
await fs.appendFile(this.auditLogPath, line, "utf8");
}
}
小结
监控与可观测性 Plugin 将 Claude 的黑盒决策过程变得透明可追踪。三个核心支柱——指标(聚合状态)、日志(事件详情)、链路追踪(因果关系)——共同构建了一个完整的可观测性体系。通过 Hooks 层实现的遥测数据收集是无侵入式的:Claude Code 的核心代码无需任何修改。OpenTelemetry 的行业标准接口确保了与主流监控后端(Grafana、Datadog、Jaeger)的无缝集成。成本追踪和决策审计日志是企业级部署中不可或缺的合规基础设施。下一章将讨论如何在企业环境中部署私有 Plugin 注册表。