Monitoring Microservices on Kubernetes: A Practical Guide
Kubernetes makes it easy to deploy many services. It makes it harder to know what's happening across them. Here's how to build observability into a Kubernetes-based architecture without drowning in complexity.
Why Kubernetes makes monitoring harder
In a traditional setup, a service runs on a predictable set of servers. You know the IPs, you can SSH in to check logs, and monitoring tools can target stable hosts.
Kubernetes changes this fundamentally. Pods are ephemeral — they can be created, moved, restarted, or destroyed at any time. IPs change. The same service might run across dozens of pods across multiple nodes. A request from a user touches multiple services before returning a response.
This dynamic nature is what makes Kubernetes powerful — and what makes traditional monitoring approaches break down. You need a monitoring approach built for ephemeral, distributed systems.
The four pillars of Kubernetes observability
A complete observability strategy for Kubernetes covers four areas:
- Metrics: Quantitative data about system state — request rate, error rate, latency, CPU, memory, pod restart count.
- Logs: The event stream from each container — what happened and when, in human-readable form.
- Traces: A record of a single request as it flows through multiple services, showing which service caused latency or errors.
- External uptime monitoring: Independent checks from outside the cluster to verify that what users see matches what your internal monitoring shows.
The fourth pillar is often forgotten in Kubernetes setups. Internal metrics tell you the cluster is healthy. External monitoring tells you what users actually experience — which can differ when an ingress controller misconfigures, a DNS record breaks, or an SSL certificate expires.
Metrics with Prometheus and Grafana
Prometheus is the de facto standard for Kubernetes metrics. It uses a pull model: Prometheus scrapes metrics endpoints exposed by your services on a schedule. Kubernetes service discovery lets Prometheus automatically find new pods as they start and stop.
The kube-prometheus-stack Helm chart installs Prometheus, Grafana, and a set of pre-built dashboards and alerting rules that cover:
- Cluster-level metrics: node CPU, memory, disk
- Pod-level metrics: restart count, resource requests vs. limits, OOM kills
- Kubernetes API server metrics
- Kubelet and container runtime metrics
For application metrics, instrument your services with a Prometheus client library (available for every major language). Expose a /metrics endpoint and add a ServiceMonitor resource to tell Prometheus to scrape it. Track request count, error count, and request duration as histograms — these give you the RED metrics (Rate, Errors, Duration) that matter most for services.
Log aggregation with Loki
Kubernetes containers write logs to stdout/stderr. The kubelet captures these, but by default they're only accessible while the pod is running and only via kubectl logs. When a pod restarts or is replaced, those logs are gone.
You need a log aggregation layer that collects logs as they're produced and stores them centrally. The most common Kubernetes-native setup is:
- Promtail (or Grafana Alloy) — a DaemonSet that runs on every node and ships logs to Loki
- Loki — a log aggregation system that stores logs indexed by labels (pod name, namespace, container name)
- Grafana — for querying and viewing logs alongside metrics dashboards
This stack (often called PLG — Promtail + Loki + Grafana) integrates tightly with Prometheus/Grafana and is much lighter than Elasticsearch. Loki doesn't full-text index logs by default, which reduces storage cost significantly.
Log everything in structured JSON format. Key fields to always include: level, request_id, service, user_id (where applicable). This makes filtering and correlation much easier.
Distributed tracing with OpenTelemetry
Traces are what let you follow a single user request as it crosses multiple service boundaries. Without tracing, when a user reports "the checkout is slow," you can't tell from metrics alone whether the problem is in the frontend service, the payment service, or the database.
OpenTelemetry (OTel) has become the standard instrumentation layer for distributed tracing. It provides:
- SDKs for every major language that auto-instrument common frameworks (Express, FastAPI, Spring Boot, gRPC, etc.)
- A vendor-neutral wire format (OTLP) so you can switch backends without re-instrumenting
- A Kubernetes-native collector (OpenTelemetry Collector) that receives traces from services and forwards them to a backend
For the backend, options include Jaeger (open-source, self-hosted), Grafana Tempo (integrates with the Prometheus/Loki stack), or Honeycomb/Datadog if you prefer managed SaaS.
Propagate trace context across service calls via HTTP headers (W3C Trace Context is the standard). This ties together all the spans from a single user request into one trace, even across dozens of services.
Kubernetes-specific metrics to watch
Beyond the standard RED metrics, there are several Kubernetes-specific signals that indicate cluster health problems before they cause user-visible failures:
| Metric | Why it matters |
|---|---|
| Pod restart count | Restarts indicate crashes or OOM kills — a rising restart count is a leading indicator of instability |
| Pending pods | Pods stuck in Pending often mean resource exhaustion or node pressure — they can't be scheduled |
| OOMKilled containers | Container was killed for exceeding its memory limit — memory limit is set too low or there's a leak |
| Node pressure (CPU, memory, disk) | When nodes hit pressure conditions, Kubernetes starts evicting pods |
| PVC usage | Persistent volumes running full cause pod failures and data loss |
External monitoring: the view from outside the cluster
Your Prometheus setup tells you what's happening inside the cluster. But it can't tell you what a user in Frankfurt experiences when they open your app — because it's running inside the same cluster you're monitoring.
External monitoring completes the picture. From outside the cluster, PingBase checks:
- Whether your public endpoints are reachable and returning the correct response codes
- Whether SSL certificates for your domains are valid and not approaching expiry
- Response time as seen from multiple geographic regions
This catches things internal monitoring misses: a misconfigured ingress that breaks a specific path, a DNS propagation failure, a certificate expiry that internal checks don't surface because they bypass TLS.
Add one PingBase monitor per public-facing endpoint. Pair it with your status page so that during incidents, users have a place to check rather than filing support tickets.
Alerting strategy for Kubernetes
With all these signals, alerting discipline is critical. A few principles:
- Use Alertmanager with Prometheus for internal cluster alerts. It handles deduplication, grouping, and routing so you don't get 50 alerts when one node fails.
- Group alerts by service and severity. A page-level alert should only fire when users are affected. Warning-level alerts can batch into a daily digest.
- Test your alerting pipeline regularly. Alerts that haven't fired in months may be misconfigured. Run drills: deliberately kill a pod and verify the alert fires and routes correctly.
- Keep runbooks linked from every alert. An alert without a runbook creates panic. A link to a Notion/Confluence page with "when this fires, do X" reduces mean time to resolution dramatically.
Add external monitoring to your Kubernetes setup
PingBase monitors your public endpoints from outside the cluster — catching what internal metrics miss. Free for up to 5 monitors, no credit card required.
Get started free →Related
The Modern DevOps Monitoring Stack: Tools and Best Practices
A broader look at all four layers of a complete monitoring strategy.
On-Call Best Practices: How to Handle Incidents Without Burning Out
How to structure on-call rotations and incident response for distributed systems.
What Is Uptime Monitoring? A Beginner's Guide
The fundamentals of external uptime monitoring and why it complements internal metrics.