← Blog
Operations 9 min read

How to Write an Incident Postmortem

An incident postmortem is the difference between an outage that happens once and one that happens again. Most teams skip them or write ones that identify blame rather than system failures. Here's how to write one that actually prevents recurrence.

After an incident is resolved, the temptation is to move on. The system is back up, the immediate pain is over, and there's a backlog of regular work waiting. Skipping the postmortem feels like a rational tradeoff. It isn't. Without a postmortem, the conditions that caused the incident are still in place. The same incident will recur.


The blameless principle

A postmortem that assigns blame is worse than no postmortem. When people expect to be blamed, they under-report what they did during an incident, avoid admitting mistakes in retros, and become less likely to flag risks proactively in the future. The culture deteriorates.

Blameless postmortems operate on the assumption that given the same information, pressures, and tools available at the time, the people involved made the best decisions they could. The goal is to find the system conditions that made failure possible, not to identify the person who triggered it.

This doesn't mean no accountability. It means accountability is directed at processes and systems, not individuals. "The deploy pipeline didn't require a staging review" is a system fix. "Alice didn't test in staging" is blame.


When to write a postmortem

Not every incident warrants a full postmortem. Use these thresholds:


Postmortem template

Incident: [Title]

Date: YYYY-MM-DD
Duration: HH:MM (detection to resolution)
Severity: P1 / P2 / P3
Author: [Name]
Reviewed by: [Team]


Summary
One paragraph. What happened, how long it lasted, what impact customers experienced. Written for someone who wasn't on-call.


Impact
- Users affected: [N] or [% of traffic]
- Features affected: [list]
- Revenue impact: [$X] if calculable
- SLA impact: [yes/no, contractual obligations triggered]


Timeline
HH:MM — [event]
HH:MM — [event]
...


Root cause
The direct technical cause. Then the contributing conditions. Use five-whys if needed.


Detection
How was the incident detected? By uptime monitor, by a customer report, by on-call engineer? How long did detection take? Could we have detected it earlier?


Response
What was done to mitigate and resolve. What worked. What didn't. What slowed the response down.


Action items
[ ] Owner — description — due date
[ ] Owner — description — due date


Writing a useful timeline

The timeline is the most important section. It reconstructs the full sequence of events with timestamps. A good timeline surfaces the gap between when the incident actually started and when it was detected — which is almost always longer than teams expect.

Example timeline for a database connection exhaustion incident:

14:12 — Deploy of v2.4.1 completed to production

14:18 — Database connection pool begins exhausting (not yet visible to users)

14:31 — First user-facing 503 errors begin appearing on API routes

14:44 — PingBase alert fires: /api/health returning 503

14:46 — On-call engineer paged, begins investigation

14:52 — Root cause identified: new query in v2.4.1 not closing connections

15:03 — Hotfix deployed, connection pool recovering

15:11 — Error rate returns to baseline, incident resolved

Total customer impact: 40 minutes (14:31–15:11)
Detection gap: 13 minutes (14:31–14:44)

Notice the detection gap. The incident started 13 minutes before the alert fired. That's 13 minutes of users experiencing errors with no one aware. This is what makes uptime monitoring valuable: the timeline makes the gap explicit, and the action item becomes "reduce check interval from 5 minutes to 1 minute."


Five-why root cause analysis

For the root cause section, five-whys forces you past the surface technical cause to the underlying system conditions:

  1. Why did the site go down? The database ran out of connections.
  2. Why did it run out of connections? A new query in v2.4.1 opened connections without closing them.
  3. Why did that get deployed? Code review didn't catch the connection leak.
  4. Why didn't code review catch it? We don't have automated checks for unclosed database resources.
  5. Why don't we have those checks? We've never added linting rules for resource management in this language.

The action item from level 1 is "fix the query." The action item from level 5 is "add a linter rule that prevents this class of bug permanently." Level 5 action items are the ones that prevent recurrence.


Action items that actually get done

Postmortem action items die when they have no owner and no due date. Structure each one with:

Review open action items at your next engineering all-hands. If items are consistently not completed, the process of writing them is theater. Raise the issue explicitly.


The detection action item: always check your monitoring

Every postmortem should ask: how was this detected, and how could we have detected it earlier? The detection gap in the timeline — between when the incident started and when anyone knew — is almost always reducible.

Common detection improvements:

The goal: the next incident of this type should be detected in under 2 minutes, not 13.

Detect incidents before your customers do

External uptime monitoring, SSL checks, response time alerts. Start free — no credit card.

Start free →

Related