SRE Practice Guide
SLI / SLO / SLA Definitions
| Term | Definition | Example |
|---|---|---|
| SLI (Indicator) | The measurable metric that indicates service health | Request success rate, latency p99, error rate |
| SLO (Objective) | Target value for the SLI over a time window | 99.9% availability over 30 days |
| SLA (Agreement) | Contractual commitment โ consequences for missing SLO | 99.9% uptime; 10% service credit if below |
| Error Budget | 1 - SLO = allowable downtime/errors | 99.9% SLO = 43.8 min/month budget |
Common SLIs
| Service Type | Key SLIs |
|---|---|
| Request/Response (API) | Availability (2xx/total), latency p99, error rate |
| Data Pipeline | Freshness (time since last successful run), correctness |
| Storage | Durability (data loss rate), read/write availability, latency |
| Batch Processing | Throughput, completion rate, success rate |
Error Budget Calculation
# SLO: 99.9% availability over 30 days
Error Budget = (1 - 0.999) ร 30 ร 24 ร 60 = 43.2 minutes
# Current burn rate
Burn Rate = (Error Rate / (1 - SLO)) ร (window / SLO window)
# Alert: fast burn (last 1h burning 2% of monthly budget)
Fast Burn Alert: burn_rate > 14.4 for 1h
โ page on-call
# Alert: slow burn (6h window)
Slow Burn Alert: burn_rate > 6 for 6h
โ create ticket
Availability Numbers
| Availability | Downtime/Year | Downtime/Month | Downtime/Week |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.31 hours | 1.68 hours |
| 99.9% (three nines) | 8.77 hours | 43.8 min | 10.1 min |
| 99.95% | 4.38 hours | 21.9 min | 5.04 min |
| 99.99% (four nines) | 52.6 min | 4.38 min | 1.01 min |
| 99.999% (five nines) | 5.26 min | 26.3 sec | 6.05 sec |