On-Call Best Practices: How to Handle Incidents Without Burning Out
Being on-call is part of owning production software. Done poorly, it destroys sleep, morale, and retention. Done well, it builds confidence and accountability. Here's what good on-call looks like.
The cost of bad on-call
Alert fatigue is real. When engineers are paged multiple times a night for things that don't require immediate action, they learn to ignore alerts. When they do, real outages go unnoticed. The fix isn't more alerts — it's better alerts.
Burnout from on-call is a leading cause of attrition at engineering-heavy companies. Engineers who are consistently paged overnight, with no escalation path and no clear process, leave. The institutional knowledge leaves with them. The cycle repeats with the next hire.
The goal of on-call done right: the team is confident that when something breaks, someone will know — but they're not stressed about every alert, and they sleep most nights.
Designing alert tiers: not everything should page
The most important structural decision in on-call is separating alerts by urgency. There are three tiers:
| Tier | Definition | Response | Channel |
|---|---|---|---|
| Critical (P1) | Users affected right now | Respond in <15 min, 24/7 | Phone call / SMS |
| Warning (P2) | Degraded, not down; will worsen | Respond in <4 hours, business hours | Slack notification |
| Info (P3) | Anomaly worth knowing about | Review next business day | Email digest |
The critical mistake is routing P2 and P3 alerts through the same phone-call channel as P1 alerts. Engineers quickly learn that most pages aren't critical, and start sleeping through them — including the real ones.
When in doubt about which tier an alert belongs to: ask "would I be upset if someone called me at 3am about this?" If the honest answer is yes, it's P2 or P3.
Alert quality: the signal-to-noise problem
Every alert that fires and doesn't require action trains engineers to ignore alerts. This is the alert fatigue loop. The fix is ruthless alert hygiene:
- Alert on symptoms, not metrics. Don't alert when CPU is over 80%. Alert when error rate exceeds 1% or response time exceeds 3s — things that indicate users are affected.
- Every alert must be actionable. If the alert fires and the response is "wait and see," it's not a P1. Demote it or add a delay. A P1 alert means someone must do something right now.
- Review all alerts that fired last week. For every alert: was it a real problem? Did it require the action it triggered? If it was noise, tune or remove it.
- Add a grace period before escalation. A single failed uptime check might be a transient network blip. Three consecutive failures mean something is wrong. Configure PingBase to send an alert after multiple consecutive failures, not the first.
Runbooks: the most underrated tool in incident response
A runbook is a document that tells an on-call engineer exactly what to do when a specific alert fires. It's the difference between an engineer who knows the system cold walking through a fix, and a new team member staring at an alert at 2am with no idea where to start.
Every alert should link to a runbook. The runbook should answer:
- What does this alert mean? What broke?
- How do I verify the problem? (Which dashboard, which command, which log query)
- What are the likely causes? (In order of probability, based on past incidents)
- What are the steps to resolve each cause?
- If I can't resolve it, who do I escalate to?
- How do I communicate status to users? (Status page update template, Slack message format)
Runbooks don't need to be long. A two-paragraph runbook that covers the 80% case is infinitely better than no runbook. Start sparse and expand as incidents reveal gaps.
On-call rotations: fair, sustainable, and explicitly documented
A good on-call rotation is:
- Time-bounded. One week on, N weeks off (where N depends on team size). The rotation should be predictable and scheduled weeks in advance.
- Fairly distributed. Everyone carries their share, including senior engineers. Having on-call fall entirely on junior engineers is not a sustainable or equitable approach.
- Shadowed for new engineers. Before going on-call solo, new team members should shadow an experienced on-caller for at least one rotation. This builds confidence and transfers knowledge.
- Compensated or traded for time off. Being woken up at night is a real cost. Either compensate financially or give time off after a heavy on-call week. Teams that don't do this lose engineers.
Tools: PagerDuty, Opsgenie, and the alerting built into Grafana all support rotation management, automatic escalation policies, and override scheduling for vacations.
Incident response: the first 15 minutes
When a P1 alert fires, the first 15 minutes determine whether the incident lasts 30 minutes or 4 hours. A clear response protocol reduces chaos:
- Acknowledge the alert immediately. Stops the escalation chain. Doesn't mean you know the fix — just that you're looking at it.
- Open the incident channel. Create a dedicated Slack channel or incident room. All communication about the incident happens there. No side conversations. This creates a searchable record.
- Update the status page. Within 5 minutes, post a status page update: "We're investigating reports of [symptom]. Updates will follow." This reduces inbound support noise dramatically.
- Identify the scope. Is this affecting all users or a subset? All regions or one? Since a specific deploy or config change? These questions narrow the cause quickly.
- Establish an incident commander. One person owns the incident. Others help. Without this, two engineers might be working on conflicting mitigations simultaneously.
Post-mortems: learning from incidents
A post-mortem (or post-incident review) is a structured document written after every significant incident. Its purpose is learning, not blame.
A blameless post-mortem asks: what did the system do, given the information available to it? Not: who made a mistake? The goal is systemic improvement, not individual punishment. Systems that blame individuals create cultures where engineers hide problems.
Every post-mortem should include:
- A timeline of what happened (factual, not interpretive)
- Root cause analysis — the actual technical reason, not just "human error"
- Contributing factors — what made this easier to happen or harder to detect/fix
- Action items — concrete changes with owners and due dates
- Detection time — how long from incident start to alert
- Resolution time — how long from alert to recovery
The action items are the most important part. A post-mortem with no follow-through is just a blame document with extra steps. Assign owners, add items to the engineering backlog, and review completion at the next team meeting.
The role of external uptime monitoring in on-call
Your internal monitoring tells you when your services are struggling. External uptime monitoring — from a tool like PingBase — tells you when users can't reach your service at all.
This is the most critical alert you can receive. If a user tries to load your app and gets a timeout, a 502, or a security warning, that's a P1 incident regardless of what your internal dashboards show.
Route uptime alerts to your primary on-call channel with the highest urgency. A PingBase alert that fires at 3am because your site is down is exactly the alert that should wake someone up. It's unambiguous: the service is unreachable. Everything else can wait until morning.
The first alert to set up: external uptime monitoring
PingBase checks your site every minute and alerts your on-call channel the moment it goes down. Free for up to 5 monitors — setup takes 5 minutes.
Get started free →Related
The Modern DevOps Monitoring Stack: Tools and Best Practices
How to build the monitoring foundation that makes on-call manageable.
Monitoring Microservices on Kubernetes: A Practical Guide
Alerting strategy specific to distributed Kubernetes deployments.
What Is a Status Page and Why Your SaaS Needs One
How a status page reduces support load during incidents.