Triage production incidents with severity assessment, root cause hypothesis, impact analysis, and escalation recommendations. Use when analyzing outages, errors, or production issues.
Follow this systematic approach for every incident:
## Incident Summary
- **Reported At:** [timestamp]
- **Reported By:** [source/person]
- **Initial Symptom:** [what was observed]
- **Affected System:** [service/component name]
Classify the incident using this severity matrix:
| Severity | Impact | Examples | Response Time |
|---|---|---|---|
| SEV-1 (Critical) | Complete outage, data loss, security breach | Service down for all users, data corruption, unauthorized access | Immediate (< 15 min) |
| SEV-2 (High) | Major feature broken, significant user impact | Payment processing failing, login broken for subset | < 30 minutes |
| SEV-3 (Medium) | Degraded performance, workaround exists | Slow response times, intermittent errors | < 2 hours |
| SEV-4 (Low) | Minor issue, minimal user impact | UI glitch, non-critical feature affected | < 24 hours |
Document the blast radius:
## Impact Analysis
### Affected Components
- [ ] Frontend/UI
- [ ] API/Backend
- [ ] Database
- [ ] Third-party integrations
- [ ] Infrastructure
### User Impact
- **Users Affected:** [number/percentage]
- **Regions Affected:** [list regions]
- **Customer Tiers:** [enterprise/pro/free]
### Business Impact
- **Revenue at Risk:** $[amount]/hour
- **SLA Status:** [within/breaching]
- **Reputational Risk:** [low/medium/high]
Generate hypotheses based on available data:
## Root Cause Hypotheses
### Hypothesis 1: [Most Likely]
- **Theory:** [description]
- **Supporting Evidence:** [what points to this]
- **Contradicting Evidence:** [what doesn't fit]
- **Confidence:** [high/medium/low]
- **Investigation Steps:**
1. [step 1]
2. [step 2]
### Hypothesis 2: [Alternative]
- **Theory:** [description]
- **Supporting Evidence:** [evidence]
- **Confidence:** [level]
| Category | Indicators | First Steps |
|---|---|---|
| Deployment | Recent deploy, gradual rollout issues | Check deploy logs, rollback |
| Capacity | High CPU/memory, request timeouts | Scale up, check autoscaling |
| Dependency | External service errors, timeout patterns | Check status pages, circuit breakers |
| Data | Query errors, migration issues | Check DB logs, recent migrations |
| Configuration | Feature flag changes, config updates | Review recent config changes |
| Network | DNS issues, connectivity problems | Check network status, DNS resolution |
| Security | Auth failures, rate limiting | Review auth logs, check for attacks |
## Escalation Path
### Immediate Notifications
- [ ] On-call engineer: [name/team]
- [ ] Team lead: [name]
- [ ] Stakeholders: [list]
### Escalation Triggers
- If not resolved in [X] minutes → Escalate to [team/person]
- If data integrity confirmed → Escalate to [security/data team]
- If customer-facing SLA breach → Notify [customer success]
### Communication Plan
- **Internal:** [Slack channel, frequency]
- **External:** [Status page update needed? Customer comms?]
| Severity | Primary | Backup | Management | Customer Comms |
|---|---|---|---|---|
| SEV-1 | On-call + Team Lead | Director | VP within 30 min | Immediate |
| SEV-2 | On-call | Team Lead | Director if > 1hr | If > 30 min |
| SEV-3 | On-call | N/A | Daily standup | If requested |
| SEV-4 | Ticket owner | On-call | N/A | N/A |
Use this template for documenting the triage:
# Incident Triage Report
## Summary
- **Incident ID:** [INC-XXXX]
- **Severity:** [SEV-1/2/3/4]
- **Status:** [investigating/identified/monitoring/resolved]
- **Started:** [timestamp]
- **Duration:** [ongoing/X hours]
## What Happened
[Brief description of the incident]
## Impact
- **Users Affected:** [count/percentage]
- **Services Affected:** [list]
- **Business Impact:** [description]
## Root Cause
- **Confirmed/Suspected:** [status]
- **Category:** [deployment/capacity/dependency/etc.]
- **Description:** [detailed explanation]
## Timeline
| Time | Event |
|------|-------|
| HH:MM | Incident detected |
| HH:MM | Triage started |
| HH:MM | Root cause identified |
| HH:MM | Mitigation applied |
## Actions Taken
1. [Action 1]
2. [Action 2]
## Escalations
- [Who was notified and when]
## Next Steps
- [ ] [Immediate action]
- [ ] [Follow-up action]
- [ ] [Post-incident review scheduled]
For rapid assessment, work through this checklist: