What is a postmortem?
A postmortem is a written review after an incident: what happened, the timeline, the impact, the root cause, and the concrete actions to prevent a recurrence. The "blameless" part is essential — it focuses on systems and process, not on punishing whoever was at the keyboard.
Why it matters
Outages are inevitable; learning from them is optional, and it is what separates teams that improve from teams that repeat the same failure. A blameless culture gets honest postmortems, which produce real fixes. It is a hallmark of mature operations and a frequent topic in senior interviews.
What to learn
- The structure: timeline, impact, root cause, actions
- Blameless culture and psychological safety
- The five whys for root-cause analysis
- Distinguishing root cause from triggers
- Actionable follow-ups with owners and dates
- Sharing learnings across the org
- Tracking action items to completion
Common pitfall
Writing a postmortem that names a person as the cause — "engineer ran the wrong command." That kills the honesty future postmortems depend on and ignores the real question: why did the system let one command cause an outage? Blame the missing guardrail, not the human; fix the system so the mistake cannot recur.
Resources
Primary (free):
- Google SRE — Postmortem culture · docs
- Google SRE — Example postmortem · docs
- PagerDuty — Postmortems · docs
Practice
Take a real or hypothetical incident and write a blameless postmortem: a timeline, the impact, a root cause found with the five whys, and two or three follow-up actions with owners. Check that no line blames a person. Done when every action targets a system or process change.
Outcomes
- Write a structured, blameless postmortem.
- Find a root cause with the five whys.
- Separate the root cause from the trigger.
- Produce follow-up actions with owners and dates.