Alerting & On-Call

Intro

Good alerting wakes humans only when humans need to act. Severity tiers, SLO burn-rate alerting, and runbooks turn noisy thresholds into a calm on-call rotation. Alert fatigue is the number one quality problem — fix it by tracking page volume and deleting low-value alerts.

Overview

Severity levels

Define clear tiers so responders know how urgently to act:

Severity	Response Time	Action	Example
P1 / Critical	Immediate (< 5 min)	Page on-call, wake people up	Total service outage, data loss, security breach
P2 / High	Within 30 min	Page during business hours	Degraded service, error rate > SLO, partial outage

Alerting on causes instead of symptoms. An alert for "CPU > 80%" fires while users are perfectly happy (batch job running) and stays silent while users suffer (single slow database query consuming 0% CPU). Alert on what users experience: latency, error rate, and availability. Surface cause metrics on dashboards for diagnosis, not as alert conditions.
Thresholds chosen by gut feel rather than baseline measurement. A threshold of 1% error rate will fire constantly on a service that normally runs at 0.8% errors, or stay silent on a service whose normal is 0.1% and which just jumped to 0.9%. Set thresholds from two or more weeks of production baseline, not round numbers.
Alerts without runbooks. An alert that fires and leaves the on-call engineer to invent the response from scratch means every incident takes longer than it needs to and accumulates inconsistent notes. Every alert that pages must link to a runbook with: what it means, likely causes, diagnosis steps, and mitigation options.
Informational pages — alerts where no action is required. If the on-call engineer's only response is "acknowledge and wait for it to auto-resolve", the alert should not page. Page only when a human must act immediately. Move monitoring-only signals to dashboards or non-paging tickets.
Per-instance alerts in a horizontally scaled fleet. Alerting separately for each of N identical pods or instances produces N simultaneous pages for a single incident. Group by service or cluster and alert on the service-level SLO, not on individual instance metrics.
Letting noisy alerts persist because "we'll fix it later." An alert that fires regularly and is routinely ignored trains the team to ignore all alerts. Track alert-to-action rate: if a team acknowledges an alert and takes no action at least 20% of the time, that alert is noise and must be tuned, downgraded, or deleted.
Not pre-silencing alerts during planned maintenance or deploys. An alert that fires during a known maintenance window generates noise that obscures real problems and burns on-call attention. Proactively silence or downgrade alerts for the duration of any planned operation that will trigger known threshold violations.

# Alert: <alert_name>
## What it means
One sentence explaining the condition and its user impact.

## Likely causes
- Cause 1 (most common)
- Cause 2
- Cause 3

## Diagnosis steps
1. Check <dashboard_link> for the current state
2. Run <command> to verify the symptom
3. Check recent deploys: <deploy_log_link>

## Mitigation
- Quick fix: <rollback command or toggle>
- If that doesn't work: <escalation path>

## Escalation
- Team: <team_name> (#slack-channel)
- Secondary: <person or team>

Alerting Oncall | Skills Pool

Alerting Oncall

Alerting Oncall

Alerting & On-Call

Intro

Overview

Severity levels

Page vs ticket

SLO-based (burn-rate) alerting

Runbook minimum

Alert fatigue prevention

Gotchas

Full reference

Runbook template

Escalation policy

On-call handoff

Incident communication

Anti-patterns

Things Mac

Trello

Production Scheduling

Jira Integration

Production Scheduling

Cost Aware Llm Pipeline