Chapter 45

Observability: Logging, Metrics and Tracing

Observability: Logging, Metrics and Tracing

In 2016, Charity Majors wrote in a blog post: "You are not debugging code. You are debugging a system." This sentence precisely describes the operational challenge of modern distributed systems: your code may be entirely correct, yet your service begins misbehaving because an upstream service's latency spiked, a database connection pool was exhausted, or a Kubernetes node is under memory pressure. In this environment, the traditional approach of "look at the logs to find the error" is nowhere near sufficient.

The word "observability" comes from control theory: a system's observability is the degree to which you can infer its internal state from its external outputs. A highly observable system allows you to understand any problem you have never seen before in production โ€” using data, not prediction. This chapter covers the three pillars of observability for Go services: logging, metrics, and distributed tracing.

Level 1 ยท Why Observability

Each Pillar Has Its Own Blind Spots

Logs are the oldest and most intuitive observability tool. When a 404 or 500 appears, you open the logs, find the relevant lines, and see what happened. Logs have high information density โ€” a single log entry can carry arbitrary context. But logs have fundamental limitations.

First, logs are event-oriented: they record "what happened" but not "how often." You can see from a log that one database query took 5 seconds, but you cannot tell whether this is a one-in-a-thousand transient event or something that happens every ten seconds. Second, log volume can overwhelm you. A medium-scale microservices cluster may produce hundreds of thousands of log lines per second. At this volume, searching and correlating logs is expensive. Third, logs have no aggregation semantics: you cannot extract "the P99 response time over the past 5 minutes" directly from raw logs โ€” you need to ship them into Elasticsearch or ClickHouse first, then write a query, introducing several minutes of latency.

Metrics address logs' weakness in aggregation. Metrics are numerical summaries of system state, naturally suited for time-series databases like Prometheus. You can query P99 latency, requests per second, and error rates in real time. But metrics have low information density โ€” a counter only tells you "how many times," not "what each occurrence looked like." When P99 latency suddenly spikes, metrics tell you that it happened but not why.

Traces solve a problem specific to distributed systems: a request traverses multiple services, each contributing some portion of the total latency โ€” where is the bottleneck? Distributed tracing stitches together the complete lifetime of a request โ€” from the client, through the API gateway, through business services, all the way to the database โ€” into a tree of time-ordered Spans. However, tracing is the most expensive pillar: it requires propagating trace context across all service boundaries, and its storage costs are higher than both logs and metrics.

True observability is not picking one of the three. It is using all three together: metrics alert you that a problem exists (P99 latency is high), traces help you locate the problem (which database query in which service is slow), and logs help you understand the details (the exact SQL statement, its parameters, and the error message).

The Cost of Observability

Observability is not free. Storing logs, metrics, and traces requires infrastructure. Collecting them adds CPU and memory overhead to your application. Maintaining the observability stack itself requires engineering time.

A single Prometheus instance at medium scale can handle millions of samples per second, but when you have thousands of services each with hundreds of metrics, you need federation or a horizontally scalable solution like Thanos or Cortex. Jaeger trace storage may require tens of gigabytes to multiple terabytes per day, depending on sampling rate. Log storage is typically the most expensive of the three.

This is why sampling is so important in observability engineering. It is neither possible nor necessary to record a complete trace for every single request โ€” but we must ensure that problematic requests (high latency, errors) are not discarded.

Level 2 ยท How They Work Under the Hood

log/slog: Go's Standard for Structured Logging

Go 1.21 introduced the log/slog package, the standard library's official answer to structured logging. Before it, the ecosystem was fragmented between zap (open-sourced by Uber) and zerolog, with each project writing its own adapter layer.

The core design of slog is the Handler interface:

type Handler interface {
    Enabled(context.Context, Level) bool
    Handle(context.Context, Record) error
    WithAttrs(attrs []Attr) Handler
    WithGroup(name string) Handler
}

A slog.Logger does not write log output directly. It builds a Record and passes it to a Handler. The standard library ships two built-in handlers: TextHandler (human-readable format) and JSONHandler (machine-readable format). Production environments almost universally use JSONHandler, because JSON logs can be parsed directly by Elasticsearch, Loki, ClickHouse, and similar tools.

Key differences between slog and zap:

Log levels and sampling: In high-concurrency systems, Debug log volume can exceed Info log volume by orders of magnitude. Enabling Debug logging in production is usually catastrophic. slog's Enabled(ctx, level) mechanism enables early exit at the outermost log call site, avoiding the cost of building a Record that will be discarded:

if logger.Enabled(ctx, slog.LevelDebug) {
    // Only do expensive string formatting when Debug is actually enabled
    logger.Debug("detailed request log", "headers", formatHeaders(r.Header))
}

Prometheus: The Pull Model for Metrics

Prometheus uses a pull model: the Prometheus Server periodically makes HTTP requests to the /metrics endpoint of monitored services, scraping a current snapshot of metrics. This is the opposite of the push model used by tools like StatsD.

Advantages of the pull model:

Prometheus's four fundamental metric types:

Counter: a monotonically increasing cumulative value, for recording total event counts (total requests, total errors). Do not use a Counter's raw value directly; use the rate() function to compute the per-second increment.

Gauge: a value that can increase or decrease, for recording instantaneous state (active connections, memory usage, queue depth).

Histogram: distributes observations into predefined buckets, for measuring distributions of latency, request size, etc. A Histogram automatically provides _bucket, _sum, and _count metrics, enabling computation of any percentile (P50, P95, P99).

Summary: similar to Histogram but computes percentiles on the client side. Because percentiles are non-additive across replicas (you cannot average two services' P99 to get the overall P99), Summary is less flexible than Histogram in multi-replica environments.

The Go Prometheus client's in-memory storage: each Metric is added to a Registry (global or custom) when registered. During a scrape, the Registry iterates all Metrics and serializes them to Prometheus text format. This serialization is O(n) in the number of metrics and typically completes in milliseconds.

OpenTelemetry: The Unified Standard for Distributed Tracing

OpenTelemetry (OTel) emerged from the merger of OpenCensus and OpenTracing and is now a CNCF incubating project. Its mission is to be a unified standard for collecting and exporting observability data.

OTel's core concepts:

Trace: a global view of one complete request, consisting of a group of Spans, identified by a globally unique Trace ID (128 bits).

Span: one unit of work within a trace, with a start and end time, a status (OK/Error), attributes (key-value pairs), and events (log-like records anchored to a point in time). A Span has a Span ID (64 bits) and an optional Parent Span ID, forming a tree structure.

Context Propagation: when a request crosses a service boundary, the trace context (Trace ID + Span ID + sampling flags) must travel with the request. The W3C TraceContext specification defines a standard HTTP header format (traceparent), which OTel implements.

Exporter: after the OTel SDK collects Span data, an Exporter sends it to a storage backend (Jaeger, Zipkin, Tempo, etc.) or to an OTel Collector via OTLP (OpenTelemetry Protocol), which can fan out to multiple backends.

The Go OTel SDK architecture: a TracerProvider is the entry point for tracing. You create a Tracer from it and use the Tracer to create Spans. The sampling strategy is configured when initializing the TracerProvider and is transparent to application code.

Level 3 ยท Code in Practice

Structured Logging for a Gin Service with log/slog

package main

import (
    "context"
    "log/slog"
    "net/http"
    "os"
    "time"

    "github.com/gin-gonic/gin"
    "github.com/google/uuid"
)

type contextKey string

const traceIDKey contextKey = "trace_id"

func loggerFromCtx(ctx context.Context) *slog.Logger {
    logger := slog.Default()
    if traceID, ok := ctx.Value(traceIDKey).(string); ok {
        logger = logger.With("trace_id", traceID)
    }
    return logger
}

func SlogMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
        start := time.Now()
        traceID := c.GetHeader("X-Trace-ID")
        if traceID == "" {
            traceID = uuid.New().String()
        }

        ctx := context.WithValue(c.Request.Context(), traceIDKey, traceID)
        c.Request = c.Request.WithContext(ctx)
        c.Header("X-Trace-ID", traceID)

        c.Next()

        logger := loggerFromCtx(ctx)
        logger.Info("http request",
            "method", c.Request.Method,
            "path", c.Request.URL.Path,
            "status", c.Writer.Status(),
            "latency_ms", time.Since(start).Milliseconds(),
            "client_ip", c.ClientIP(),
            "bytes", c.Writer.Size(),
        )
    }
}

func main() {
    handler := slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
        Level: slog.LevelInfo,
    })
    slog.SetDefault(slog.New(handler))

    r := gin.New()
    r.Use(SlogMiddleware())

    r.GET("/users/:id", func(c *gin.Context) {
        logger := loggerFromCtx(c.Request.Context())
        logger.Debug("fetching user", "user_id", c.Param("id"))
        c.JSON(http.StatusOK, gin.H{"id": c.Param("id")})
    })

    r.Run(":8080")
}

Prometheus Instrumentation for a Gin Service

package metrics

import (
    "strconv"
    "time"

    "github.com/gin-gonic/gin"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "Distribution of HTTP request processing times",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "path", "status"},
    )

    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )

    httpActiveRequests = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "http_active_requests",
            Help: "Number of HTTP requests currently being processed",
        },
    )

    businessErrorsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "business_errors_total",
            Help: "Total number of business logic errors",
        },
        []string{"error_type"},
    )
)

func PrometheusMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
        start := time.Now()
        httpActiveRequests.Inc()
        defer httpActiveRequests.Dec()

        c.Next()

        status := strconv.Itoa(c.Writer.Status())
        duration := time.Since(start).Seconds()
        // Use the route template (/users/:id), NOT the actual path
        // to avoid a high-cardinality explosion in Prometheus
        path := c.FullPath()
        if path == "" {
            path = "unknown"
        }

        labels := prometheus.Labels{
            "method": c.Request.Method,
            "path":   path,
            "status": status,
        }
        httpRequestDuration.With(labels).Observe(duration)
        httpRequestsTotal.With(labels).Inc()
    }
}

func RegisterMetricsEndpoint(r *gin.Engine) {
    r.GET("/metrics", gin.WrapH(promhttp.Handler()))
}

The choice of c.FullPath() over c.Request.URL.Path is critical. If you use the actual URL path (e.g., /users/12345), every distinct user ID generates a new label combination, causing a high-cardinality problem: Prometheus maintains independent time series for every unique label set, and its memory usage explodes.

Distributed Tracing Across HTTP Calls with OpenTelemetry

package tracing

import (
    "context"
    "net/http"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)

func InitTracer(ctx context.Context, serviceName, endpoint string) (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracehttp.New(ctx,
        otlptracehttp.WithEndpoint(endpoint),
        otlptracehttp.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    res, err := resource.New(ctx,
        resource.WithAttributes(
            semconv.ServiceName(serviceName),
            semconv.ServiceVersion("1.0.0"),
        ),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(res),
        // Sample 10% of requests in production
        sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)),
    )

    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{}, // W3C TraceContext standard
        propagation.Baggage{},
    ))

    return tp, nil
}

func fetchUserFromDB(ctx context.Context, userID int64) (*User, error) {
    tracer := otel.Tracer("user-service")
    ctx, span := tracer.Start(ctx, "db.findUser")
    defer span.End()

    span.SetAttributes(
        attribute.Int64("user.id", userID),
        attribute.String("db.system", "postgresql"),
        attribute.String("db.statement", "SELECT * FROM users WHERE id = $1"),
    )

    user, err := db.FindUser(ctx, userID)
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, err
    }

    span.SetAttributes(attribute.String("user.name", user.Name))
    return user, nil
}

// Inject the current trace context into outbound HTTP request headers
func callDownstream(ctx context.Context, url string) (*http.Response, error) {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
    if err != nil {
        return nil, err
    }

    // Injects W3C traceparent header automatically
    otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))

    return http.DefaultClient.Do(req)
}

W3C TraceContext traceparent format: 00-{trace-id}-{parent-id}-{flags}, for example 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01. A downstream service extracts this value and calls otel.GetTextMapPropagator().Extract(ctx, propagation.HeaderCarrier(req.Header)) to restore the trace context. Any new Span created in that context automatically becomes a child of the upstream Span.

Injecting Trace IDs into Log Entries

The key to correlating metrics, traces, and logs is ensuring every log entry carries the Trace ID of the current request. When you see a latency anomaly in Grafana, you can jump directly to the corresponding trace, then click "view logs" to see all log entries written during that Span.

import "go.opentelemetry.io/otel/trace"

func slogWithTrace(ctx context.Context) *slog.Logger {
    span := trace.SpanFromContext(ctx)
    if !span.SpanContext().IsValid() {
        return slog.Default()
    }
    return slog.Default().With(
        "trace_id", span.SpanContext().TraceID().String(),
        "span_id",  span.SpanContext().SpanID().String(),
    )
}

Grafana's Loki data source can extract the trace_id field from log entries and automatically generate jump links to Tempo (Grafana's distributed tracing backend), creating seamless navigation between logs, metrics, and traces.

Building a Grafana Dashboard

A Grafana dashboard is built around PromQL queries. Key monitoring panels:

# Request rate (per second)
rate(http_requests_total[5m])

# P99 latency (requires Histogram type)
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path)
)

# Error rate (fraction of 5xx responses)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (path)
/
sum(rate(http_requests_total[5m])) by (path)

# Active connections (Gauge)
http_active_requests

A well-designed Grafana dashboard follows the USE method (Utilization, Saturation, Errors) or RED method (Rate, Errors, Duration) rather than dumping every metric onto a single page.

Level 4 ยท Advanced Topics and Edge Cases

Log Sampling Strategies: Head vs Tail

Head-based sampling: at the moment a request enters the system, decide probabilistically whether to trace it (e.g., 10%). The advantage is simplicity and low overhead; the disadvantage is that 90% of the time, a valuable error request will be discarded.

Tail-based sampling: collect all Spans first; after the complete trace is assembled, decide whether to keep it based on the outcome (did an error occur? did latency exceed a threshold?). The advantage is that all "interesting" requests are guaranteed to be recorded; the disadvantage is that a middle tier (typically the OTel Collector) must buffer all Spans for a complete trace in memory, making it more complex and memory-intensive.

OTel Collector supports tail-based sampling via tailsamplingprocessor:

# otel-collector-config.yaml
processors:
  tail_sampling:
    decision_wait: 10s  # wait 10s to collect the complete trace
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 1000}
      - name: sample-10-percent
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

This configuration retains all error requests, all requests exceeding 1 second, and 10% of all other requests.

Exemplars: Linking Metrics to Traces

An Exemplar (as defined in the OpenMetrics specification) is a specific sample point attached to a Prometheus Histogram bucket, carrying a Trace ID. When Grafana displays P99 latency and it suddenly spikes, you can click the Exemplar to jump directly to the corresponding trace.

Using Exemplars in Go:

import "github.com/prometheus/client_golang/prometheus"

// Use ObserveWithExemplar instead of Observe
httpRequestDuration.With(labels).(prometheus.ExemplarObserver).ObserveWithExemplar(
    duration,
    prometheus.Labels{
        "traceID": span.SpanContext().TraceID().String(),
    },
)

Prometheus must be started with --enable-feature=exemplar-storage to persist Exemplar data.

Continuous Profiling with Pyroscope

Distributed tracing tells you which Span is slowest but not why. Continuous profiling fills that gap. Pyroscope (now part of Grafana) is an open-source continuous profiling tool that periodically takes pprof CPU and memory samples from running Go processes, storing them with timestamps for historical analysis.

Integrating Pyroscope into a Go service:

import "github.com/grafana/pyroscope-go"

func initPyroscope(serviceName string) {
    pyroscope.Start(pyroscope.Config{
        ApplicationName: serviceName,
        ServerAddress:   "http://pyroscope:4040",
        ProfileTypes: []pyroscope.ProfileType{
            pyroscope.ProfileCPU,
            pyroscope.ProfileAllocObjects,
            pyroscope.ProfileAllocSpace,
            pyroscope.ProfileInuseObjects,
            pyroscope.ProfileInuseSpace,
            pyroscope.ProfileGoroutines,
        },
    })
}

Pyroscope supports OTel integration, injecting the Trace ID as a label into the profile โ€” enabling "click from a Trace directly into the flame graph from that exact moment in time."

SLI / SLO / SLA: Turning Metrics into Commitments

SLI (Service Level Indicator): a quantitative measure of service quality, typically expressed as a PromQL query. Examples: "fraction of successful requests", "P99 latency."

SLO (Service Level Objective): the target value for an SLI. Examples: "99.9% of requests should complete within 200 ms", "monthly availability must be at least 99.95%."

SLA (Service Level Agreement): the SLO expressed as an external legal commitment, typically with penalty clauses for violations.

Managing SLOs centers on the concept of the error budget: if your SLO is 99.9%, your monthly error budget is 0.1% ร— 30 days ร— 24 hours ร— 60 minutes = 43.2 minutes of permitted downtime. When the error budget is depleted, the team should halt new feature releases and focus exclusively on reliability improvements.

Tracking error budget burn rate in Prometheus:

# Does the error rate over the past 1 hour exceed 14.4x the SLO budget rate?
(
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
) > 14.4 * 0.001  # 14.4x burn rate for a 99.9% SLO

Alert Fatigue Management

Too many alerts is more dangerous than no alerts: engineers begin ignoring them, and real problems are buried in noise.

Alert design principles:

  1. Every alert must be actionable: when an alert fires, the on-call engineer should immediately know what to do. If there is no runbook, the alert should not exist.

  2. Alert on SLO burn rate, not raw metric thresholds: instead of alerting "P99 latency exceeds 500 ms," alert "error budget burn rate is too fast (burn rate > 14.4x)."

  3. Multi-window, multi-burn-rate alerting (from Google's SRE book):

# Short window, fast alert โ€” catches rapidly developing incidents
- alert: HighBurnRateFast
  expr: |
    (error_rate_1h > 14.4 * 0.001) and (error_rate_5m > 14.4 * 0.001)
  for: 2m
  labels:
    severity: critical

# Long window, slow alert โ€” catches slow-burning issues, suppresses transient spikes
- alert: HighBurnRateSlow
  expr: |
    (error_rate_6h > 6 * 0.001) and (error_rate_30m > 6 * 0.001)
  for: 15m
  labels:
    severity: warning

Cost Optimization for High-Volume Logging

The most effective way to optimize log costs is reducing log volume at the source, not compressing at the storage layer.

Reuse structured fields: avoid repeating the same fields (service name, environment, version) in every log entry. Add these once at slog.Logger initialization via With; they will automatically appear in all subsequent entries.

Log sampling: for high-frequency, low-value log entries (health check requests, heartbeats), apply sampling:

var healthCheckCount atomic.Int64

func healthCheckHandler(c *gin.Context) {
    if healthCheckCount.Add(1)%100 == 0 {
        slog.Info("health check", "total_checks", healthCheckCount.Load())
    }
    c.Status(http.StatusOK)
}

Tiered storage: keep hot data (last 24 hours) on Elasticsearch hot nodes; query cold data (older than 7 days) via Loki or S3 + Athena, dramatically reducing storage costs.

Observability is not a one-time infrastructure investment โ€” it is a continuously evolving practice. As system complexity grows, your investment in observability should grow proportionally. An observable system gives you, at 3 a.m. during a production incident, enough data to find the root cause within 15 minutes. That capability is worth far more than its cost.

Rate this chapter
4.7  / 5  (3 ratings)

๐Ÿ’ฌ Comments