Name: Postmortem Template
Author: testified-oss

Search skills.../

Postmortem Template | Skills Pool

Instead of...	Use...
"John caused the outage"	"The deployment process allowed an untested change"
"Sarah made a mistake"	"The system lacked safeguards against this action"
"The engineer should have known"	"The documentation/training didn't cover this scenario"
"Human error"	"The interface design permitted incorrect input"
"Carelessness"	"The process lacked verification steps"

# Postmortem: [Incident Title]

**Incident ID:** [INC-XXXX]
**Date:** [YYYY-MM-DD]
**Author:** [Name]
**Status:** Draft | In Review | Final
**Severity:** SEV-1 | SEV-2 | SEV-3

## Executive Summary

[2-3 sentence summary: what happened, impact, how it was resolved]

**Duration:** [Start time] - [End time] ([X hours Y minutes])
**Time to Detect:** [X minutes]
**Time to Resolve:** [X hours]
**Customer Impact:** [Brief description of user-facing impact]

---

## Timeline

All times in [timezone].

| Time | Event |
|------|-------|
| HH:MM | [Triggering event or change] |
| HH:MM | [First symptoms observed] |
| HH:MM | [Alert fired / Incident reported] |
| HH:MM | [Triage began] |
| HH:MM | [Root cause identified] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Service restored] |
| HH:MM | [Incident declared resolved] |

---

## Impact Assessment

### User Impact
- **Users Affected:** [Number or percentage]
- **Geographic Regions:** [Affected regions]
- **Duration of Impact:** [Time users experienced issues]

### Business Impact
- **Revenue Impact:** $[amount] (estimated)
- **SLA Status:** [Met / Breached - details]
- **Support Tickets:** [Number created]
- **Customer Escalations:** [Number and severity]

### Technical Impact
- **Services Affected:** [List of services]
- **Data Impact:** [None / Delayed / Lost - details]
- **Dependent Systems:** [Downstream effects]

---

## Contributing Factors

### Primary Factors

1. **[Factor 1 Title]**
   - **Description:** [What happened]
   - **Why it mattered:** [How it contributed to the incident]
   - **Evidence:** [Logs, metrics, observations]

2. **[Factor 2 Title]**
   - **Description:** [What happened]
   - **Why it mattered:** [How it contributed]
   - **Evidence:** [Supporting data]

### Secondary Factors

- [Factor that amplified impact or delayed resolution]
- [Process gap that allowed the primary cause]
- [Missing safeguard or detection mechanism]

### Environmental Factors

- [Timing considerations - peak traffic, maintenance window]
- [Team capacity - on-call coverage, expertise availability]
- [External dependencies - third-party status]

---

## Root Cause Analysis

### The Five Whys

1. **Why did [symptom] occur?**
   → Because [cause 1]

2. **Why did [cause 1] happen?**
   → Because [cause 2]

3. **Why did [cause 2] happen?**
   → Because [cause 3]

4. **Why did [cause 3] happen?**
   → Because [cause 4]

5. **Why did [cause 4] happen?**
   → Because [root cause]

### Root Cause Summary

[1-2 paragraph summary of the root cause(s), focusing on systemic issues rather than individual actions]

---

## What Went Well

- [Effective alert that caught the issue quickly]
- [Team coordination during incident]
- [Runbook that helped with resolution]
- [Communication that kept stakeholders informed]

## What Went Poorly

- [Detection delay or missed alerts]
- [Missing runbook or documentation]
- [Communication gaps]
- [Tool or process limitations]

## Where We Got Lucky

- [Factors that could have made this worse]
- [Near-misses during resolution]
- [Timing that reduced impact]

---

## Action Items

### Immediate (< 1 week)

| ID | Action | Owner | Due Date | Status |
|----|--------|-------|----------|--------|
| A1 | [Specific action to prevent recurrence] | [Name] | [Date] | Open |
| A2 | [Immediate mitigation measure] | [Name] | [Date] | Open |

### Short-term (< 1 month)

| ID | Action | Owner | Due Date | Status |
|----|--------|-------|----------|--------|
| A3 | [Process improvement] | [Name] | [Date] | Open |
| A4 | [Monitoring enhancement] | [Name] | [Date] | Open |

### Long-term (< 1 quarter)

| ID | Action | Owner | Due Date | Status |
|----|--------|-------|----------|--------|
| A5 | [Architectural change] | [Name] | [Date] | Open |
| A6 | [Systemic improvement] | [Name] | [Date] | Open |

---

## Prevention Measures

### Detection Improvements
- [ ] [New alert or monitoring to catch this earlier]
- [ ] [Dashboard or visibility enhancement]

### Prevention Controls
- [ ] [Automated safeguard to prevent recurrence]
- [ ] [Process gate or approval requirement]
- [ ] [Validation or testing improvement]

### Response Improvements
- [ ] [Runbook creation or update]
- [ ] [Training or knowledge sharing]
- [ ] [Tool or automation improvement]

---

## Appendix

### Supporting Data
- [Link to logs]
- [Link to metrics/dashboards]
- [Link to related incidents]

### Participants
- **Incident Commander:** [Name]
- **Investigators:** [Names]
- **Postmortem Author:** [Name]
- **Reviewers:** [Names]

### Document History
| Date | Author | Changes |
|------|--------|---------|
| [Date] | [Name] | Initial draft |
| [Date] | [Name] | Added action items |

Source	What to Look For
Monitoring/APM	Error rates, latency changes, anomalies
Log aggregation	Error messages, exceptions, warnings
Deployment tools	Recent deploys, config changes
Incident chat	First reports, updates, decisions
Alert history	What fired, when, acknowledgments
Change management	Approved changes in timeframe

## Detailed Timeline

### Pre-Incident
- **[T-24h]** [Relevant changes or conditions]
- **[T-1h]** [Immediate precursors]

### Incident Active
- **[T+0]** [Triggering event]
- **[T+5m]** [First observable symptom]
- **[T+10m]** [Alert fired]
- **[T+15m]** [On-call acknowledged]
- **[T+30m]** [Initial hypothesis formed]
- **[T+45m]** [Root cause identified]
- **[T+60m]** [Mitigation applied]

### Post-Incident
- **[T+2h]** [Verification complete]
- **[T+24h]** [Postmortem scheduled]

| Category | Questions | Example Findings |
|----------|-----------|------------------|
| **Code** | What code contributed? | Missing null check, race condition |
| **Config** | What configuration was involved? | Incorrect timeout, wrong endpoint |
| **Infrastructure** | What infra limitations existed? | Insufficient capacity, single point of failure |
| **Dependencies** | What external factors? | Third-party API latency, network issues |

| Category | Questions | Example Findings |
|----------|-----------|------------------|
| **Testing** | What wasn't tested? | No load testing, missing edge cases |
| **Review** | What review gaps existed? | No security review, rushed approval |
| **Deployment** | What deployment issues? | No canary, missing rollback plan |
| **Monitoring** | What wasn't monitored? | Missing alert, no dashboard |

| Category | Questions | Example Findings |
|----------|-----------|------------------|
| **Knowledge** | What wasn't known? | Undocumented dependency, tribal knowledge |
| **Communication** | What wasn't communicated? | No handoff, missed status page update |
| **Resources** | What constraints existed? | Short staffed, competing priorities |
| **Incentives** | What pressures contributed? | Deadline pressure, velocity metrics |

Bad	Good
"Improve monitoring"	"Add alert for error rate > 1% on /api/auth"
"Be more careful"	"Add pre-deploy checklist to CI pipeline"
"Review code better"	"Require 2 approvers for changes to payment module"
"Fix the bug"	"Add null check in UserService.getProfile()"
"Document the process"	"Create runbook for database failover by [date]"

Category	Purpose	Examples
Detect	Find issues faster	Alerts, dashboards, health checks
Prevent	Stop issues from occurring	Validation, guards, automation
Mitigate	Reduce impact when issues occur	Circuit breakers, fallbacks, caching
Respond	Improve incident response	Runbooks, training, tooling

## Postmortem Draft: [Incident Title]

### Quick Summary
- **What happened:** [Brief description]
- **Impact:** [User and business impact]
- **Duration:** [How long]
- **Root cause:** [Primary cause]

### Key Learnings
1. [Learning 1]
2. [Learning 2]
3. [Learning 3]

### Priority Action Items
1. [ ] [Most important action] - Owner: [Name] - Due: [Date]
2. [ ] [Second action] - Owner: [Name] - Due: [Date]
3. [ ] [Third action] - Owner: [Name] - Due: [Date]

### Full Document
[Complete postmortem using template above]

Assume good intent	People made the best decisions with available information
Focus on systems	Ask "what allowed this to happen?" not "who caused this?"
No blame, no shame	Create safety for honest discussion
Learn, don't punish	Goal is improvement, not discipline
Multiple causes	Incidents rarely have a single root cause
Seek understanding	"How did it make sense to do X at the time?"

Postmortem Template

When to Use

When NOT to Use

Blameless Postmortem Principles

Postmortem Template

When to Use

When NOT to Use

Blameless Postmortem Principles

Blameless Language Guide

Postmortem Document Template

Timeline Reconstruction

1. Gather Data Sources

2. Anchor Events

3. Timeline Template

Contributing Factors Analysis

Technical Factors

Process Factors

Organizational Factors

Action Item Best Practices

SMART Action Items

Good vs Bad Action Items

Action Item Categories

Postmortem Meeting Facilitation

Agenda (60-90 minutes)

Facilitation Tips

Output Format

Best Practices

Update Skills

Eval Harness

Ecc Tools Cost Audit

Code Tour

Rules Distill

Design System