← Blog
DevOps 9 min read

The Modern DevOps Monitoring Stack: Tools and Best Practices

Monitoring isn't a single tool — it's a stack. Uptime checks, infrastructure metrics, application traces, logs, and alerting each solve a different piece of the observability puzzle. Here's how to build it without the complexity spiral.

Why "just use one tool" doesn't work

There's a temptation to find the single monitoring platform that does everything. A few enterprise tools claim to — but at a price that assumes you have a team of 200+ engineers and a $200k/year observability budget. For most teams, the right answer is a focused stack of purpose-built tools that each do one thing well.

The modern DevOps monitoring stack has four distinct layers, each answering a different question:

Alerting sits across all four layers. Let's go through each one.


Layer 1: External uptime monitoring

Uptime monitoring checks your service from outside your infrastructure. It's the only layer that catches the problems your users see before you do: a crashed server, a failed DNS record, an expired SSL certificate causing browser security warnings.

This is your first line of defense. Before anything else, you need external checks running on a short interval — every 1–5 minutes — with alerts going somewhere that will actually wake someone up.

What to monitor externally:

Tools: PingBase, Better Uptime, UptimeRobot. PingBase checks from multiple locations simultaneously, which eliminates false positives from regional network issues.


Layer 2: Infrastructure metrics

Infrastructure metrics tell you how your servers, containers, and cloud resources are performing. CPU, memory, disk I/O, network throughput — these are the vital signs of your underlying compute.

For most teams, this means one of:

The key metrics to instrument from day one: request rate, error rate, and latency (the RED method). For infrastructure: CPU utilization, memory pressure, disk saturation, and network errors.


Layer 3: Application performance monitoring

APM tools instrument your code to capture traces — a record of everything that happened during a single request: which functions were called, how long each took, and what queries were run.

Distributed tracing is especially valuable in microservice architectures, where a single user request might touch 5–10 services. Without tracing, pinpointing which service is causing a slowdown requires guesswork.

Common APM tools:

Tool Best for Cost model
SentryError tracking + basic performanceFree tier, then per event
HoneycombHigh-cardinality distributed tracingPer event volume
Jaeger / TempoOpen-source distributed tracingSelf-hosted, free
Datadog APMFull-stack in one platformPer host/month, expensive

For small teams: Sentry covers most error tracking needs with minimal setup. As you scale, add OpenTelemetry instrumentation to your services so you can swap backends without re-instrumenting.


Layer 4: Log aggregation

Logs are the narrative of what your system did. Metrics tell you that something went wrong; logs tell you exactly what happened, when, and why.

At small scale, you can get by with reading logs directly from servers or your cloud provider's log viewer. But as soon as you have more than one server — or more than a few services — you need a centralized log aggregation system.

Common choices:


Alerting: the glue between layers

Every layer generates signals. The job of alerting is to route the right signals to the right people without creating alert fatigue — the state where so many alerts fire that engineers start ignoring them.

Best practices for alert design:

  1. Alert on symptoms, not causes. Alert when users are affected (high error rate, high latency, site down), not when a metric crosses an arbitrary threshold (CPU > 80%).
  2. Every alert needs a runbook. An alert that fires without a clear action is noise. Write down what to do when each alert fires.
  3. Tier your urgency. Not every problem needs to wake someone at 3am. Use severity levels: critical (page immediately), warning (notify during business hours), info (log for review).
  4. Set error budgets, not fixed thresholds. A 1% error rate at 100 requests/day is very different from 1% at 1M requests/day. Size your alerts accordingly.

Alert routing tools: PagerDuty, Opsgenie, and the alert routing built into Grafana and Datadog all work well. Start simple: email and Slack for most alerts, phone calls for critical incidents.


A minimal stack for a small team

You don't need everything at once. Here's a pragmatic starting point that covers the essentials without drowning in tooling:

  1. PingBase — external uptime monitoring, SSL checks, status page. Free tier covers 5 monitors. Takes 5 minutes to set up.
  2. Sentry — error tracking and basic performance. Free tier is generous. Add the SDK to your app and you're done.
  3. Cloud-native logs — use your cloud provider's built-in logging for now. Centralize later when you have the volume to justify it.
  4. Grafana Cloud — free hosted Prometheus + Grafana for infrastructure metrics. Enough for most small teams.

This stack costs nearly nothing, covers all four layers, and can be set up in a day. Expand each layer as your needs grow.

Start with the layer that matters most

External uptime monitoring is the first thing to set up — it's the only tool that catches what your users see. PingBase takes 5 minutes to configure and monitors every minute.

Get started free →

Related