The Incident Management Playbook: From Alert to Resolution

Most incidents don't become disasters because of bad infrastructure — they become disasters because the response is uncoordinated. Here's a playbook that works for teams of any size.

Phase 1: Detection

The incident management process starts before a human wakes up. You need automated detection that's sensitive enough to catch real problems fast, but configured to suppress noise that would cause alert fatigue.

What you need in place before the next incident:

Uptime monitors on your critical endpoints, checking every 1–2 minutes
Response time alerts configured with thresholds (warn at 2s, critical at 5s)
Multi-region checks to confirm an outage is real, not a network blip
Alert channels that reach the right people — on-call rotation, Slack, PagerDuty

Detection without alerting is just logging. Make sure alerts go somewhere where they'll be seen by whoever is responsible for responding.

Phase 2: Acknowledgment

The moment an on-call engineer sees the alert, they should acknowledge it. Acknowledgment serves two purposes:

It prevents escalation — if no one acknowledges within N minutes, the alert should escalate to someone else
It signals to the team "someone is on this" — preventing two engineers from simultaneously diving into the same incident without coordination

For small teams, acknowledgment can be as simple as posting in a Slack channel: "Seeing the alert, investigating now. On it."

Phase 3: Triage

Triage answers one question: how bad is this? The answer determines how many people you pull in and how much urgency to communicate externally.

Severity levels (adapt to your context):

SEV-1: Critical

Complete outage. All users are affected. Core functionality is inaccessible. Immediate all-hands response. Update status page immediately.

SEV-2: Major

Significant degradation. Large portion of users affected or key features broken. On-call engineer plus one other. Update status page.

SEV-3: Minor

Limited impact. Small percentage of users affected or non-critical functionality broken. On-call engineer handles alone. Status page update optional.

Classify quickly and communicate the classification. It tells everyone — including your support team — how to respond to user inquiries.

Phase 4: Communication

User communication is the most overlooked part of incident response — and the most trust-damaging when done poorly.

The golden rule: update your status page before users ask. If users are hitting errors and your status page shows all green, the trust damage is compounded — not just "the product broke" but "they didn't even tell us."

Status page update cadence:

First update: As soon as you've confirmed the incident is real (within 5 minutes of acknowledgment)
Subsequent updates: Every 15–30 minutes, even if there's nothing new to report
Resolution update: What was fixed, when, and a brief explanation

The first update doesn't need to have answers. "We are aware of an issue affecting [feature]. Our team is investigating. We will provide an update in 15 minutes." That's enough to stop users from feeling ignored.

Phase 5: Investigation and mitigation

The goal of investigation isn't to find the root cause — that comes later in the postmortem. The goal is to find the fastest path to restoring service. Those are often different things.

Start with "what changed?" Most incidents are caused by recent changes — a deploy, a config change, a dependency update, a traffic spike. Look at your deployment history from the last 24 hours first.

Mitigation before root cause:

If a deploy caused it: roll back
If traffic spiked: scale up or rate-limit
If a database is overloaded: identify and kill slow queries, enable connection pooling
If a third-party service is down: switch to fallback or disable the integration

Restoring service first is almost always the right call. You can investigate why the database got overloaded after users can use the product again.

Phase 6: Resolution

Resolution is when your monitors confirm the service is healthy again — not when you think you've fixed it. Wait for at least 2–3 consecutive clean checks before marking the incident as resolved.

When you resolve:

Update your status page with a resolution message
Notify your support team so they can respond to any open tickets
Schedule the postmortem (within 48 hours)
Brief your broader team if it was a significant incident

Phase 7: Postmortem

The postmortem is where you convert a bad day into lasting improvement. Write it within 48 hours while memories are fresh.

A good postmortem covers:

Timeline: When did the incident start? When was it detected? When was it resolved?
Impact: How many users were affected? What functionality was broken? For how long?
Contributing factors: Not just "the database crashed" but why it crashed and what conditions led to that
What went well: What worked in your response? (This matters — it tells you what to preserve)
Action items: Specific, assigned, time-bounded tasks to prevent recurrence

Blameless postmortems are more useful than blame-assigning ones. The goal is systemic improvement, not finding who made the mistake.

The incident management checklist

Good incident management is a skill that compounds over time. Each incident you handle well makes the next one faster and less stressful. The teams that are calmest under pressure didn't get there by luck — they ran the process enough times that it became muscle memory.

Continue reading

Operations

How to Write an Incident Postmortem

Operations

Incident Communication Best Practices