Alert design and on-call practices — severity, runbooks, SLO burn-rate alerting, escalation. Use when designing alerts, writing runbooks, reducing alert fatigue, setting up escalation policies, implementing SLO-based alerting, or improving on-call processes.
Good alerting wakes humans only when humans need to act. Severity tiers, SLO burn-rate alerting, and runbooks turn noisy thresholds into a calm on-call rotation. Alert fatigue is the number one quality problem — fix it by tracking page volume and deleting low-value alerts.
Define clear tiers so responders know how urgently to act:
| Severity | Response Time | Action | Example |
|---|---|---|---|
| P1 / Critical | Immediate (< 5 min) | Page on-call, wake people up | Total service outage, data loss, security breach |
| P2 / High | Within 30 min | Page during business hours | Degraded service, error rate > SLO, partial outage |
| P3 / Medium |
| Within 4 hours |
| Ticket, fix today |
| Elevated latency, disk 80% full, cert expiring in 7 days |
| P4 / Low | Within 1 business day | Ticket, fix this sprint | Deprecated API usage, non-critical background job failures |
Only P1 and P2 should page. P3 and P4 must create tickets and never page. Every alert has exactly one severity; if you cannot decide, it is P3.
The single most important decision in alert design is whether the alert pages a human or files a ticket. Page when users are currently impacted, data is at risk, or the situation will get worse without intervention. Ticket when the issue can wait hours or days, recovery is likely automatic, or it is a warning about future risk. If the responder's only action is "acknowledge and wait", it should be a ticket. If you page more than twice a week for non-incidents, your alerts need tuning.
Threshold alerts like "error rate > 1%" are noisy. SLO alerting fires only when you are consuming your error budget too fast.
burn_rate = actual_error_rate / (1 - SLO_target)This alerts early on severe burns and late on slow burns, matching human response capabilities.
Every alert that pages must have a runbook with: what the alert means in one sentence, likely causes ranked by frequency, diagnosis steps with dashboard and command links, mitigation (quick fix plus escalation path), and the team and channel that owns it.
Track pages per week per person; target less than two pages per on-call shift. Review every alert weekly — if it fired and no action was taken, delete or downgrade it. Merge per-instance alerts into per-service grouped alerts. Set thresholds from percentile baselines, not guesses. Auto-resolve transient spikes within 5 minutes. Silence known alerts proactively during deploys and planned maintenance.
Agent-specific failure modes — provider-neutral pause-and-self-check items:
# Alert: <alert_name>
## What it means
One sentence explaining the condition and its user impact.
## Likely causes
- Cause 1 (most common)
- Cause 2
- Cause 3
## Diagnosis steps
1. Check <dashboard_link> for the current state
2. Run <command> to verify the symptom
3. Check recent deploys: <deploy_log_link>
## Mitigation
- Quick fix: <rollback command or toggle>
- If that doesn't work: <escalation path>
## Escalation
- Team: <team_name> (#slack-channel)
- Secondary: <person or team>
Hand off at a consistent time (e.g. Monday 10:00 local). The outgoing on-call writes a summary covering active incidents, flaky alerts, known issues, and anything to watch. The incoming on-call reviews and acknowledges. Shadow rotations pair new team members with experienced on-call for 1-2 cycles before they take primary.
During a P1/P2 incident: open an incident channel immediately, post an initial summary (what is broken, who is affected, current severity), update every 15-30 minutes even if the update is "still investigating", post to a public status page if external users are affected, and after resolution publish a timeline, root cause, and follow-up actions within 48 hours.