Three Pillars
Logs | Timestamped event records — what happened |
Metrics | Numeric time-series data — how is the system performing |
Traces | Request path across services — where is the latency |
Tools
Prometheus + Grafana | Metrics collection and dashboards |
Jaeger / Zipkin | Distributed tracing |
ELK Stack | Elasticsearch, Logstash, Kibana for logs |
OpenTelemetry | Vendor-neutral instrumentation standard |
Datadog / New Relic | All-in-one commercial APM |
SLO/SLA/SLI
SLI (Indicator) | Measurement: e.g., request success rate |
SLO (Objective) | Target: e.g., 99.9% success rate |
SLA (Agreement) | Contract with consequences for missing SLO |
Error budget | Allowed failure within SLO period |