Core Workflow
- Assess reliability - Review architecture, SLOs, incidents, toil levels
- Define SLOs - Identify meaningful SLIs and set appropriate targets
- Verify alignment - Confirm SLO targets reflect user expectations before proceeding
- Implement monitoring - Build golden signal dashboards and alerting
- Automate toil - Identify repetitive tasks and build automation
- Test resilience - Design and execute chaos experiments; verify recovery meets RTO/RPO targets before marking the experiment complete; validate recovery behavior end-to-end
Reference Guide
Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|
| SLO/SLI | references/slo-sli-management.md | Defining SLOs, calculating error budgets |
| Error Budgets |