Observability Guide

Three Pillars

LogsTimestamped event records — what happened
MetricsNumeric time-series data — how is the system performing
TracesRequest path across services — where is the latency

Tools

Prometheus + GrafanaMetrics collection and dashboards
Jaeger / ZipkinDistributed tracing
ELK StackElasticsearch, Logstash, Kibana for logs
OpenTelemetryVendor-neutral instrumentation standard
Datadog / New RelicAll-in-one commercial APM

SLO/SLA/SLI

SLI (Indicator)Measurement: e.g., request success rate
SLO (Objective)Target: e.g., 99.9% success rate
SLA (Agreement)Contract with consequences for missing SLO
Error budgetAllowed failure within SLO period