How to Write an Incident Postmortem
An incident postmortem is the difference between an outage that happens once and one that happens again. Most teams skip them or write ones that identify blame rather than system failures. Here's how to write one that actually prevents recurrence.
After an incident is resolved, the temptation is to move on. The system is back up, the immediate pain is over, and there's a backlog of regular work waiting. Skipping the postmortem feels like a rational tradeoff. It isn't. Without a postmortem, the conditions that caused the incident are still in place. The same incident will recur.
The blameless principle
A postmortem that assigns blame is worse than no postmortem. When people expect to be blamed, they under-report what they did during an incident, avoid admitting mistakes in retros, and become less likely to flag risks proactively in the future. The culture deteriorates.
Blameless postmortems operate on the assumption that given the same information, pressures, and tools available at the time, the people involved made the best decisions they could. The goal is to find the system conditions that made failure possible, not to identify the person who triggered it.
This doesn't mean no accountability. It means accountability is directed at processes and systems, not individuals. "The deploy pipeline didn't require a staging review" is a system fix. "Alice didn't test in staging" is blame.
When to write a postmortem
Not every incident warrants a full postmortem. Use these thresholds:
- Always: any incident with more than 30 minutes of customer impact
- Always: any data loss, even partial
- Always: any incident that required escalation beyond the on-call engineer
- Consider it: any incident that required manual intervention to resolve
- Skip it: self-resolving incidents with <5 minutes of impact and clear, non-novel cause
Postmortem template
Incident: [Title]
Date: YYYY-MM-DD
Duration: HH:MM (detection to resolution)
Severity: P1 / P2 / P3
Author: [Name]
Reviewed by: [Team]
Summary
One paragraph. What happened, how long it lasted, what impact customers experienced. Written for someone who wasn't on-call.
Impact
- Users affected: [N] or [% of traffic]
- Features affected: [list]
- Revenue impact: [$X] if calculable
- SLA impact: [yes/no, contractual obligations triggered]
Timeline
HH:MM — [event]
HH:MM — [event]
...
Root cause
The direct technical cause. Then the contributing conditions. Use five-whys if needed.
Detection
How was the incident detected? By uptime monitor, by a customer report, by on-call engineer? How long did detection take? Could we have detected it earlier?
Response
What was done to mitigate and resolve. What worked. What didn't. What slowed the response down.
Action items
[ ] Owner — description — due date
[ ] Owner — description — due date
Writing a useful timeline
The timeline is the most important section. It reconstructs the full sequence of events with timestamps. A good timeline surfaces the gap between when the incident actually started and when it was detected — which is almost always longer than teams expect.
Example timeline for a database connection exhaustion incident:
14:12 — Deploy of v2.4.1 completed to production
14:18 — Database connection pool begins exhausting (not yet visible to users)
14:31 — First user-facing 503 errors begin appearing on API routes
14:44 — PingBase alert fires: /api/health returning 503
14:46 — On-call engineer paged, begins investigation
14:52 — Root cause identified: new query in v2.4.1 not closing connections
15:03 — Hotfix deployed, connection pool recovering
15:11 — Error rate returns to baseline, incident resolved
Total customer impact: 40 minutes (14:31–15:11)
Detection gap: 13 minutes (14:31–14:44)
Notice the detection gap. The incident started 13 minutes before the alert fired. That's 13 minutes of users experiencing errors with no one aware. This is what makes uptime monitoring valuable: the timeline makes the gap explicit, and the action item becomes "reduce check interval from 5 minutes to 1 minute."
Five-why root cause analysis
For the root cause section, five-whys forces you past the surface technical cause to the underlying system conditions:
- Why did the site go down? The database ran out of connections.
- Why did it run out of connections? A new query in v2.4.1 opened connections without closing them.
- Why did that get deployed? Code review didn't catch the connection leak.
- Why didn't code review catch it? We don't have automated checks for unclosed database resources.
- Why don't we have those checks? We've never added linting rules for resource management in this language.
The action item from level 1 is "fix the query." The action item from level 5 is "add a linter rule that prevents this class of bug permanently." Level 5 action items are the ones that prevent recurrence.
Action items that actually get done
Postmortem action items die when they have no owner and no due date. Structure each one with:
- Owner: one person, not a team
- Description: specific and verifiable ("add DB connection linter rule to CI" not "improve monitoring")
- Due date: within two weeks for P1 items, four weeks for P2
- Priority: detection items first, then prevention
Review open action items at your next engineering all-hands. If items are consistently not completed, the process of writing them is theater. Raise the issue explicitly.
The detection action item: always check your monitoring
Every postmortem should ask: how was this detected, and how could we have detected it earlier? The detection gap in the timeline — between when the incident started and when anyone knew — is almost always reducible.
Common detection improvements:
- Reduce monitor check interval (5 min → 1 min)
- Add content checks to catch pages that return 200 but show error states
- Add a health check endpoint that tests dependencies directly
- Add response time alerts so slowness alerts before it becomes downtime
- Add Slack/PagerDuty routing so alerts reach the right person faster
The goal: the next incident of this type should be detected in under 2 minutes, not 13.
Detect incidents before your customers do
External uptime monitoring, SSL checks, response time alerts. Start free — no credit card.
Start free →