Learn from failures without blame. Improve systems, not shame people.
Learn from failures without blame. Improve systems, not shame people.
Every incident is a gift—an opportunity to make the system stronger. Blame prevents learning.
| Blame Culture | Learning Culture |
|---|---|
| "Who messed up?" | "How did the system allow this?" |
| "They should have known" | "Why wasn't it obvious?" |
| "Follow the process!" | "Is the process followable?" |
| "Don't let it happen again" | "How do we prevent this class of problem?" |
Key insight: People did what made sense to them at the time, with the information they had.
What happened, when, impact.
| Time (UTC) | Event | Actor/System |
|---|---|---|
| 14:00 | Deployment started | CI/CD |
| 14:05 | Error rate increased | Monitoring |
| 14:07 | On-call paged | PagerDuty |
| ... | ... | ... |
Not "human error"—go deeper:
| Action | Owner | Due Date | Priority |
|---|---|---|---|
| Add validation | @alice | 2026-02-15 | P1 |
| Improve runbook | @bob | 2026-02-10 | P2 |
| ... | ... | ... | ... |
Rule: Every action item has an owner and date. No orphan items.
Opening (5 min)
"We're here to understand what happened and improve. This is blameless—we assume everyone acted reasonably with the info they had. Focus on systems and processes, not individuals."
Timeline Walk-through (20 min)
Root Cause Discussion (15 min)
Action Items (15 min)
Closing (5 min)
"Thank you for the candid discussion. We'll share the write-up for review. Any final thoughts?"
| Anti-Pattern | Why It's Harmful |
|---|---|
| Naming individuals in root cause | Creates fear, hides future problems |
| "They should have..." | Hindsight bias, doesn't fix systems |
| No action items | Wasted learning opportunity |
| Too many action items | Nothing gets done |
| Action items without owners | Nothing gets done |
| Never following up | Actions drift, cynicism grows |
| Only for big incidents | Small incidents have big lessons |
| Severity | Criteria | Post-Mortem Required? |
|---|---|---|
| SEV1 | Customer-facing outage > 30min | Yes, within 48 hours |
| SEV2 | Degraded service, workaround exists | Yes, within 1 week |
| SEV3 | Internal impact, no customer effect | Recommended |
| SEV4 | Near-miss, caught before impact | Optional but valuable |
Post-mortems feed into:
A personal example from Alex's own evolution...
What we learned from the Phoenix incident:
These lessons became: ADR-006, RISKS.md, the 5-layer protection system.
Every failure makes the architecture stronger.34:["$","$L3b",null,{"content":"$3c","frontMatter":{"name":"post-mortem","description":"Learn from failures without blame. Improve systems, not shame people.","tier":"extended","applyTo":"/incident,/postmortem,/retro,/failure,**/outage"}}]