Incident Communication Best Practices for SaaS
The technical work of fixing an outage is one thing. Communicating with users while it's happening is another. Done badly, even a short incident becomes a lasting trust problem. Done well, it becomes a demonstration of your reliability culture.
Here's the uncomfortable truth about incidents: users often judge you more by how you communicate than by how quickly you fix the problem. A 30-minute outage with regular updates and a clear resolution message leaves a better impression than a 10-minute outage that users only find out about after the fact.
Incident communication is a skill. These are the practices that separate teams that handle it well from teams that make a bad situation worse.
The first rule: acknowledge fast, before you know anything
The most common incident communication mistake is waiting until you know what's wrong before posting anything. This is backwards. The purpose of the first update isn't to explain the problem — it's to show users that you're aware and working on it.
You should post a status update within 5 minutes of declaring an incident, even if all you can say is:
Example first update (2 minutes after incident starts)
"We are investigating reports of elevated error rates affecting the API. Engineers are investigating. Next update in 10 minutes."
That update does three things: it confirms you know about the problem, it tells users something is being done, and it sets an expectation for the next update. Users who see this are much less likely to open a support ticket.
What you should not do: wait 20 minutes before posting anything because you wanted to include the root cause. By then, frustrated users have already emailed support, posted on social media, and formed the impression that you don't know what's happening.
Update frequency: commit to a cadence and keep it
When you post your first update, you're making an implicit promise: "I will keep you informed." The fastest way to lose user trust during an incident is to make that promise and then go quiet.
Recommended update cadence:
- First update: Within 5 minutes of declaring the incident
- During investigation: Every 15-30 minutes, even if you have no new information
- When you've identified the problem: Immediately — don't wait for the scheduled update
- When a fix is deployed: Immediately — plus what you're watching to confirm resolution
- When resolved: Post a resolution update with a brief summary of what happened and how long users were affected
The "even if you have no new information" part is important. A brief "We continue to investigate. Engineers are working on the issue. Next update in 15 minutes." tells users that the problem hasn't been forgotten. Silence does the opposite.
What to include in each update
Every status update should answer four questions, adapted to what you know at that moment:
- What is affected? Be specific. "Users attempting to log in" is better than "some users." "The API is returning 503 errors" is better than "there are issues."
- What is not affected? If you know certain features are unaffected, say so. "Existing sessions are unaffected. Users who are already logged in can continue to use the service normally." This immediately reduces the blast radius in users' minds.
- What are you doing about it? Even vague is better than nothing. "Engineers are investigating the root cause." "We have identified the issue and are deploying a fix."
- When is the next update? Always commit to the next update time. This prevents the anxiety of "are they still working on it?"
Good incident update — all four elements
"We have identified the root cause: a database configuration change deployed at 14:30 UTC is causing connection pool exhaustion. Users attempting to create new records are experiencing errors. Read operations are unaffected. We are rolling back the configuration change now. We expect full recovery within 10 minutes. Next update at 15:20 UTC."
Poor incident update — vague and unhelpful
"We are aware of issues and working to resolve them. We apologize for any inconvenience."
Tone: direct and factual, not apologetic and hedged
Incident updates are not the place for corporate hedging language. Phrases like "some users may be experiencing" and "intermittent issues with certain functionality" read as evasion. Users know something is wrong — they're reading your status page because they're affected. Speak clearly.
Things to avoid:
- "We apologize for any inconvenience" — Hollow. Say what you're doing instead.
- "Some users may be experiencing" — If you're posting an incident update, users are experiencing it. Own it.
- "Intermittent issues" — Users don't experience "intermittent." They experience "it was broken when I tried to use it." Say what's broken.
- Technical jargon your users don't understand — "We are experiencing elevated p99 latency in our us-east Kubernetes pods" means nothing to most users. Translate: "The API is responding slowly, causing some requests to time out."
- Promises you can't keep — Don't say "we expect to resolve this within 30 minutes" unless you're confident. Setting expectations you miss is worse than not setting them.
The right tone is direct, factual, and calm. You're a professional handling a technical problem. Acknowledge the impact, explain what's happening in plain language, and communicate what you're doing about it.
The resolution update: close the loop properly
Many teams do the hard work of communicating during an incident and then drop the ball on the resolution update. They mark the incident as resolved without explanation, or they post "Issue resolved" with no context.
The resolution update is the last impression users have of how you handled the incident. Make it count:
Good resolution update
"Resolved. The API is fully operational as of 15:18 UTC. Total incident duration: 48 minutes. Root cause: a database configuration change introduced at 14:30 UTC caused connection pool exhaustion under normal load. We rolled back the configuration at 15:15 UTC and have confirmed full recovery. We are reviewing our deployment process to prevent similar configuration changes from reaching production without additional verification. We'll post a full postmortem within 24 hours."
This tells users: what happened, how long it lasted, what caused it, what fixed it, and what you're doing to prevent recurrence. That final element — "what you're doing to prevent recurrence" — is important. It shows that you learned from the incident and are taking it seriously.
Using PingBase's incident timeline
PingBase's status page includes an incident timeline — a chronological log of all updates posted during an incident. When you post updates from your PingBase dashboard, they appear on the public status page in real time with timestamps.
How to use it effectively during an incident:
- Post updates directly from the dashboard or API. You don't need to be at a computer — the PingBase mobile-responsive dashboard lets you post updates from your phone during an incident.
- Use the incident status states correctly. PingBase supports: Investigating, Identified, Monitoring, and Resolved. Move through these states as your understanding develops. Users can see the progression and know you're moving toward resolution.
- Share the status page URL proactively. When an incident starts, post the status page URL in your product's error messages, in your Slack community, on Twitter. The more users who find the status page, the fewer who send support tickets.
- Let the timeline stand as your postmortem for smaller incidents. For minor incidents, the timeline of updates is often enough context. For major incidents, link to a full postmortem document from the resolution update.
Preparing before incidents happen
The best time to build your incident communication process is before you have an incident. A few things to set up in advance:
- Template your updates. Have a template ready for first acknowledgment, investigation updates, and resolution. Filling in specifics during a stressful incident is easier than writing from scratch.
- Know who posts updates. In a team, designate one person as the incident communicator — separate from the engineers investigating the problem. The communicator writes updates while engineers fix the problem. Mixing these roles slows both down.
- Set up alert channels so you know about incidents before users do. PingBase sends alerts via email, Slack, Discord, or webhook when a monitor goes down. Configure these so you're not finding out about incidents from Twitter.
- Know your status page URL and share it early. Put your status page URL in your app's error messages, in your documentation, and in your onboarding emails. Users should know where to go before they need it.
The long-term compounding effect
Every incident is an opportunity. Not a good one — you'd rather not have incidents — but an opportunity nonetheless. A team that communicates well during an outage comes out of it with users who trust them more, not less. "They kept us informed the whole time and explained exactly what happened" is a statement that people make about services they stay with.
Teams that go silent during incidents, or post vague non-updates, lose trust that's hard to rebuild. The technical quality of your service matters. So does the way you treat users when things go wrong.
Set up your incident timeline
PingBase status pages include incident management with a public timeline. Free to get started, no credit card required.
Get started free →Related
What Is a Status Page and Why Your SaaS Needs One
The fundamentals of status pages, explained from scratch.
Why Your Status Page Is Your Best Marketing Page
How transparency compounds into competitive advantage.
Uptime Monitoring Best Practices
How to set up monitoring so you know before your users do.