Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
SLA (Service Level Agreement)
↓ Contract with customers
SLO (Service Level Objective)
↓ Internal reliability target
SLI (Service Level Indicator)
↓ Actual measurement
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)
Reference: See references/slo-definitions.md
| SLO % | Downtime/Month | Downtime/Year |
|---|---|---|
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43.2 minutes | 8.76 hours |
| 99.95% | 21.6 minutes | 4.38 hours |
| 99.99% | 4.32 minutes | 52.56 minutes |
Consider:
Example SLOs: