The Incident Management Playbook: From Alert to Resolution
Most incidents don't become disasters because of bad infrastructure — they become disasters because the response is uncoordinated. Here's a playbook that works for teams of any size.
Phase 1: Detection
The incident management process starts before a human wakes up. You need automated detection that's sensitive enough to catch real problems fast, but configured to suppress noise that would cause alert fatigue.
What you need in place before the next incident:
- Uptime monitors on your critical endpoints, checking every 1–2 minutes
- Response time alerts configured with thresholds (warn at 2s, critical at 5s)
- Multi-region checks to confirm an outage is real, not a network blip
- Alert channels that reach the right people — on-call rotation, Slack, PagerDuty
Detection without alerting is just logging. Make sure alerts go somewhere where they'll be seen by whoever is responsible for responding.
Phase 2: Acknowledgment
The moment an on-call engineer sees the alert, they should acknowledge it. Acknowledgment serves two purposes:
- It prevents escalation — if no one acknowledges within N minutes, the alert should escalate to someone else
- It signals to the team "someone is on this" — preventing two engineers from simultaneously diving into the same incident without coordination
For small teams, acknowledgment can be as simple as posting in a Slack channel: "Seeing the alert, investigating now. On it."
Phase 3: Triage
Triage answers one question: how bad is this? The answer determines how many people you pull in and how much urgency to communicate externally.
Severity levels (adapt to your context):
SEV-1: Critical
Complete outage. All users are affected. Core functionality is inaccessible. Immediate all-hands response. Update status page immediately.
SEV-2: Major
Significant degradation. Large portion of users affected or key features broken. On-call engineer plus one other. Update status page.
SEV-3: Minor
Limited impact. Small percentage of users affected or non-critical functionality broken. On-call engineer handles alone. Status page update optional.
Classify quickly and communicate the classification. It tells everyone — including your support team — how to respond to user inquiries.
Phase 4: Communication
User communication is the most overlooked part of incident response — and the most trust-damaging when done poorly.
The golden rule: update your status page before users ask. If users are hitting errors and your status page shows all green, the trust damage is compounded — not just "the product broke" but "they didn't even tell us."
Status page update cadence:
- First update: As soon as you've confirmed the incident is real (within 5 minutes of acknowledgment)
- Subsequent updates: Every 15–30 minutes, even if there's nothing new to report
- Resolution update: What was fixed, when, and a brief explanation
The first update doesn't need to have answers. "We are aware of an issue affecting [feature]. Our team is investigating. We will provide an update in 15 minutes." That's enough to stop users from feeling ignored.
Phase 5: Investigation and mitigation
The goal of investigation isn't to find the root cause — that comes later in the postmortem. The goal is to find the fastest path to restoring service. Those are often different things.
Start with "what changed?" Most incidents are caused by recent changes — a deploy, a config change, a dependency update, a traffic spike. Look at your deployment history from the last 24 hours first.
Mitigation before root cause:
- If a deploy caused it: roll back
- If traffic spiked: scale up or rate-limit
- If a database is overloaded: identify and kill slow queries, enable connection pooling
- If a third-party service is down: switch to fallback or disable the integration
Restoring service first is almost always the right call. You can investigate why the database got overloaded after users can use the product again.
Phase 6: Resolution
Resolution is when your monitors confirm the service is healthy again — not when you think you've fixed it. Wait for at least 2–3 consecutive clean checks before marking the incident as resolved.
When you resolve:
- Update your status page with a resolution message
- Notify your support team so they can respond to any open tickets
- Schedule the postmortem (within 48 hours)
- Brief your broader team if it was a significant incident
Phase 7: Postmortem
The postmortem is where you convert a bad day into lasting improvement. Write it within 48 hours while memories are fresh.
A good postmortem covers:
- Timeline: When did the incident start? When was it detected? When was it resolved?
- Impact: How many users were affected? What functionality was broken? For how long?
- Contributing factors: Not just "the database crashed" but why it crashed and what conditions led to that
- What went well: What worked in your response? (This matters — it tells you what to preserve)
- Action items: Specific, assigned, time-bounded tasks to prevent recurrence
Blameless postmortems are more useful than blame-assigning ones. The goal is systemic improvement, not finding who made the mistake.
The incident management checklist
Before the next incident
During an incident
Good incident management is a skill that compounds over time. Each incident you handle well makes the next one faster and less stressful. The teams that are calmest under pressure didn't get there by luck — they ran the process enough times that it became muscle memory.