Production incident handling — triage, communicate, mitigate, fix, postmortem. Use when production is broken, users are affected, an outage is in progress, or a deploy has caused regressions and the priority is restoring service.
When production breaks, restore service first and understand it later. Triage user impact, communicate early, mitigate with the fastest safe action (often a rollback), then fix the root cause and write a blameless postmortem within 48 hours.
Answer three questions before doing anything else: what is the user impact (total outage, degraded, cosmetic), what changed recently (deploy, config flip, dependency update), and is there a quick rollback option. The answers decide whether you mitigate by reverting or by patching forward.
Notify stakeholders immediately with what is broken, who is affected, and an ETA to next update — not an ETA to fix. Post to the status page or incident channel. Re-post on a regular cadence even if there is no progress; silence is worse than bad news.
Mitigation restores service; the fix comes later. Roll back if it is safe and quick. Apply a workaround (feature flag off, traffic shed, cache bypass) to stop the bleeding. Do not debug in production if you can reproduce the issue elsewhere.
Once service is stable, identify the root cause, apply the fix in a normal change-management path, and deploy with extra monitoring. Resist the urge to ship the fix straight to prod under incident adrenaline — that is how you create the next incident.
Write a blameless postmortem covering the timeline, root cause
analysis, what went well, what did not, and concrete corrective
actions. Focus on systems and processes, never individuals. See the
postmortem-writing skill for the full template.
User reports the API is returning 500s after the latest deploy. The agent checks the deploy diff, identifies the breaking change, recommends an immediate rollback while preparing a forward fix, drafts a stakeholder update, and schedules the postmortem.
Agent-specific failure modes — provider-neutral pause-and-self-check items:
| Severity | Definition | Response |
|---|---|---|
| SEV-1 | Total outage or data loss | All-hands, immediate |
| SEV-2 | Major feature down or large user subset affected | On-call + manager |
| SEV-3 | Degraded performance, workaround exists | On-call only |
| SEV-4 | Cosmetic or low-impact bug | Normal triage |
For small teams, one person may wear multiple hats, but the commander role should never be combined with hands-on debugging.