The Modern DevOps Monitoring Stack: Tools and Best Practices
Monitoring isn't a single tool — it's a stack. Uptime checks, infrastructure metrics, application traces, logs, and alerting each solve a different piece of the observability puzzle. Here's how to build it without the complexity spiral.
Why "just use one tool" doesn't work
There's a temptation to find the single monitoring platform that does everything. A few enterprise tools claim to — but at a price that assumes you have a team of 200+ engineers and a $200k/year observability budget. For most teams, the right answer is a focused stack of purpose-built tools that each do one thing well.
The modern DevOps monitoring stack has four distinct layers, each answering a different question:
- External uptime monitoring — Is the service reachable from the outside world?
- Infrastructure metrics — How is the underlying hardware/cloud performing?
- Application performance monitoring (APM) — Where are the bottlenecks inside the code?
- Log aggregation — What happened, in what order, when something went wrong?
Alerting sits across all four layers. Let's go through each one.
Layer 1: External uptime monitoring
Uptime monitoring checks your service from outside your infrastructure. It's the only layer that catches the problems your users see before you do: a crashed server, a failed DNS record, an expired SSL certificate causing browser security warnings.
This is your first line of defense. Before anything else, you need external checks running on a short interval — every 1–5 minutes — with alerts going somewhere that will actually wake someone up.
What to monitor externally:
- Your primary domain and any subdomains with public traffic
- Critical API endpoints (health check, authentication, payment flows)
- SSL certificate expiry for all public-facing domains
- DNS resolution — particularly if you use a custom DNS provider
Tools: PingBase, Better Uptime, UptimeRobot. PingBase checks from multiple locations simultaneously, which eliminates false positives from regional network issues.
Layer 2: Infrastructure metrics
Infrastructure metrics tell you how your servers, containers, and cloud resources are performing. CPU, memory, disk I/O, network throughput — these are the vital signs of your underlying compute.
For most teams, this means one of:
- Prometheus + Grafana — the open-source default. Self-hosted, highly configurable, steep learning curve but very powerful. Prometheus scrapes metrics from your services; Grafana visualizes them.
- Datadog / New Relic — managed SaaS. Agent-based collection, built-in dashboards, expensive at scale but fast to get started.
- Cloud-native tools — AWS CloudWatch, Google Cloud Monitoring, Azure Monitor. Good if you're fully on one cloud and want the path of least resistance.
The key metrics to instrument from day one: request rate, error rate, and latency (the RED method). For infrastructure: CPU utilization, memory pressure, disk saturation, and network errors.
Layer 3: Application performance monitoring
APM tools instrument your code to capture traces — a record of everything that happened during a single request: which functions were called, how long each took, and what queries were run.
Distributed tracing is especially valuable in microservice architectures, where a single user request might touch 5–10 services. Without tracing, pinpointing which service is causing a slowdown requires guesswork.
Common APM tools:
| Tool | Best for | Cost model |
|---|---|---|
| Sentry | Error tracking + basic performance | Free tier, then per event |
| Honeycomb | High-cardinality distributed tracing | Per event volume |
| Jaeger / Tempo | Open-source distributed tracing | Self-hosted, free |
| Datadog APM | Full-stack in one platform | Per host/month, expensive |
For small teams: Sentry covers most error tracking needs with minimal setup. As you scale, add OpenTelemetry instrumentation to your services so you can swap backends without re-instrumenting.
Layer 4: Log aggregation
Logs are the narrative of what your system did. Metrics tell you that something went wrong; logs tell you exactly what happened, when, and why.
At small scale, you can get by with reading logs directly from servers or your cloud provider's log viewer. But as soon as you have more than one server — or more than a few services — you need a centralized log aggregation system.
Common choices:
- Loki + Grafana — open-source, lightweight, designed for log aggregation without indexing everything. Works well with Prometheus/Grafana stacks.
- Elasticsearch + Kibana (ELK) — full-text indexing, powerful querying, resource-intensive to run yourself.
- Logtail / Axiom — managed log platforms with generous free tiers and good DX.
- Cloudwatch Logs / GCP Logging — if you're on a single cloud, native log aggregation with no extra setup.
Alerting: the glue between layers
Every layer generates signals. The job of alerting is to route the right signals to the right people without creating alert fatigue — the state where so many alerts fire that engineers start ignoring them.
Best practices for alert design:
- Alert on symptoms, not causes. Alert when users are affected (high error rate, high latency, site down), not when a metric crosses an arbitrary threshold (CPU > 80%).
- Every alert needs a runbook. An alert that fires without a clear action is noise. Write down what to do when each alert fires.
- Tier your urgency. Not every problem needs to wake someone at 3am. Use severity levels: critical (page immediately), warning (notify during business hours), info (log for review).
- Set error budgets, not fixed thresholds. A 1% error rate at 100 requests/day is very different from 1% at 1M requests/day. Size your alerts accordingly.
Alert routing tools: PagerDuty, Opsgenie, and the alert routing built into Grafana and Datadog all work well. Start simple: email and Slack for most alerts, phone calls for critical incidents.
A minimal stack for a small team
You don't need everything at once. Here's a pragmatic starting point that covers the essentials without drowning in tooling:
- PingBase — external uptime monitoring, SSL checks, status page. Free tier covers 5 monitors. Takes 5 minutes to set up.
- Sentry — error tracking and basic performance. Free tier is generous. Add the SDK to your app and you're done.
- Cloud-native logs — use your cloud provider's built-in logging for now. Centralize later when you have the volume to justify it.
- Grafana Cloud — free hosted Prometheus + Grafana for infrastructure metrics. Enough for most small teams.
This stack costs nearly nothing, covers all four layers, and can be set up in a day. Expand each layer as your needs grow.
Start with the layer that matters most
External uptime monitoring is the first thing to set up — it's the only tool that catches what your users see. PingBase takes 5 minutes to configure and monitors every minute.
Get started free →Related
What Is Uptime Monitoring? A Beginner's Guide
How uptime checks work and why they're the foundation of any monitoring stack.
Monitoring Microservices on Kubernetes: A Practical Guide
How to handle observability when your services are spread across many pods.
On-Call Best Practices: How to Handle Incidents Without Burning Out
Alert design, runbooks, and incident response that doesn't destroy your team.