← Blog
Operations 9 min read

On-Call Best Practices: How to Handle Incidents Without Burning Out

Being on-call is part of owning production software. Done poorly, it destroys sleep, morale, and retention. Done well, it builds confidence and accountability. Here's what good on-call looks like.

The cost of bad on-call

Alert fatigue is real. When engineers are paged multiple times a night for things that don't require immediate action, they learn to ignore alerts. When they do, real outages go unnoticed. The fix isn't more alerts — it's better alerts.

Burnout from on-call is a leading cause of attrition at engineering-heavy companies. Engineers who are consistently paged overnight, with no escalation path and no clear process, leave. The institutional knowledge leaves with them. The cycle repeats with the next hire.

The goal of on-call done right: the team is confident that when something breaks, someone will know — but they're not stressed about every alert, and they sleep most nights.


Designing alert tiers: not everything should page

The most important structural decision in on-call is separating alerts by urgency. There are three tiers:

Tier Definition Response Channel
Critical (P1)Users affected right nowRespond in <15 min, 24/7Phone call / SMS
Warning (P2)Degraded, not down; will worsenRespond in <4 hours, business hoursSlack notification
Info (P3)Anomaly worth knowing aboutReview next business dayEmail digest

The critical mistake is routing P2 and P3 alerts through the same phone-call channel as P1 alerts. Engineers quickly learn that most pages aren't critical, and start sleeping through them — including the real ones.

When in doubt about which tier an alert belongs to: ask "would I be upset if someone called me at 3am about this?" If the honest answer is yes, it's P2 or P3.


Alert quality: the signal-to-noise problem

Every alert that fires and doesn't require action trains engineers to ignore alerts. This is the alert fatigue loop. The fix is ruthless alert hygiene:

  1. Alert on symptoms, not metrics. Don't alert when CPU is over 80%. Alert when error rate exceeds 1% or response time exceeds 3s — things that indicate users are affected.
  2. Every alert must be actionable. If the alert fires and the response is "wait and see," it's not a P1. Demote it or add a delay. A P1 alert means someone must do something right now.
  3. Review all alerts that fired last week. For every alert: was it a real problem? Did it require the action it triggered? If it was noise, tune or remove it.
  4. Add a grace period before escalation. A single failed uptime check might be a transient network blip. Three consecutive failures mean something is wrong. Configure PingBase to send an alert after multiple consecutive failures, not the first.

Runbooks: the most underrated tool in incident response

A runbook is a document that tells an on-call engineer exactly what to do when a specific alert fires. It's the difference between an engineer who knows the system cold walking through a fix, and a new team member staring at an alert at 2am with no idea where to start.

Every alert should link to a runbook. The runbook should answer:

Runbooks don't need to be long. A two-paragraph runbook that covers the 80% case is infinitely better than no runbook. Start sparse and expand as incidents reveal gaps.


On-call rotations: fair, sustainable, and explicitly documented

A good on-call rotation is:

Tools: PagerDuty, Opsgenie, and the alerting built into Grafana all support rotation management, automatic escalation policies, and override scheduling for vacations.


Incident response: the first 15 minutes

When a P1 alert fires, the first 15 minutes determine whether the incident lasts 30 minutes or 4 hours. A clear response protocol reduces chaos:

  1. Acknowledge the alert immediately. Stops the escalation chain. Doesn't mean you know the fix — just that you're looking at it.
  2. Open the incident channel. Create a dedicated Slack channel or incident room. All communication about the incident happens there. No side conversations. This creates a searchable record.
  3. Update the status page. Within 5 minutes, post a status page update: "We're investigating reports of [symptom]. Updates will follow." This reduces inbound support noise dramatically.
  4. Identify the scope. Is this affecting all users or a subset? All regions or one? Since a specific deploy or config change? These questions narrow the cause quickly.
  5. Establish an incident commander. One person owns the incident. Others help. Without this, two engineers might be working on conflicting mitigations simultaneously.

Post-mortems: learning from incidents

A post-mortem (or post-incident review) is a structured document written after every significant incident. Its purpose is learning, not blame.

A blameless post-mortem asks: what did the system do, given the information available to it? Not: who made a mistake? The goal is systemic improvement, not individual punishment. Systems that blame individuals create cultures where engineers hide problems.

Every post-mortem should include:

The action items are the most important part. A post-mortem with no follow-through is just a blame document with extra steps. Assign owners, add items to the engineering backlog, and review completion at the next team meeting.


The role of external uptime monitoring in on-call

Your internal monitoring tells you when your services are struggling. External uptime monitoring — from a tool like PingBase — tells you when users can't reach your service at all.

This is the most critical alert you can receive. If a user tries to load your app and gets a timeout, a 502, or a security warning, that's a P1 incident regardless of what your internal dashboards show.

Route uptime alerts to your primary on-call channel with the highest urgency. A PingBase alert that fires at 3am because your site is down is exactly the alert that should wake someone up. It's unambiguous: the service is unreachable. Everything else can wait until morning.

The first alert to set up: external uptime monitoring

PingBase checks your site every minute and alerts your on-call channel the moment it goes down. Free for up to 5 monitors — setup takes 5 minutes.

Get started free →

Related