← Blog
Kubernetes 10 min read

Monitoring Microservices on Kubernetes: A Practical Guide

Kubernetes makes it easy to deploy many services. It makes it harder to know what's happening across them. Here's how to build observability into a Kubernetes-based architecture without drowning in complexity.

Why Kubernetes makes monitoring harder

In a traditional setup, a service runs on a predictable set of servers. You know the IPs, you can SSH in to check logs, and monitoring tools can target stable hosts.

Kubernetes changes this fundamentally. Pods are ephemeral — they can be created, moved, restarted, or destroyed at any time. IPs change. The same service might run across dozens of pods across multiple nodes. A request from a user touches multiple services before returning a response.

This dynamic nature is what makes Kubernetes powerful — and what makes traditional monitoring approaches break down. You need a monitoring approach built for ephemeral, distributed systems.


The four pillars of Kubernetes observability

A complete observability strategy for Kubernetes covers four areas:

  1. Metrics: Quantitative data about system state — request rate, error rate, latency, CPU, memory, pod restart count.
  2. Logs: The event stream from each container — what happened and when, in human-readable form.
  3. Traces: A record of a single request as it flows through multiple services, showing which service caused latency or errors.
  4. External uptime monitoring: Independent checks from outside the cluster to verify that what users see matches what your internal monitoring shows.

The fourth pillar is often forgotten in Kubernetes setups. Internal metrics tell you the cluster is healthy. External monitoring tells you what users actually experience — which can differ when an ingress controller misconfigures, a DNS record breaks, or an SSL certificate expires.


Metrics with Prometheus and Grafana

Prometheus is the de facto standard for Kubernetes metrics. It uses a pull model: Prometheus scrapes metrics endpoints exposed by your services on a schedule. Kubernetes service discovery lets Prometheus automatically find new pods as they start and stop.

The kube-prometheus-stack Helm chart installs Prometheus, Grafana, and a set of pre-built dashboards and alerting rules that cover:

For application metrics, instrument your services with a Prometheus client library (available for every major language). Expose a /metrics endpoint and add a ServiceMonitor resource to tell Prometheus to scrape it. Track request count, error count, and request duration as histograms — these give you the RED metrics (Rate, Errors, Duration) that matter most for services.


Log aggregation with Loki

Kubernetes containers write logs to stdout/stderr. The kubelet captures these, but by default they're only accessible while the pod is running and only via kubectl logs. When a pod restarts or is replaced, those logs are gone.

You need a log aggregation layer that collects logs as they're produced and stores them centrally. The most common Kubernetes-native setup is:

This stack (often called PLG — Promtail + Loki + Grafana) integrates tightly with Prometheus/Grafana and is much lighter than Elasticsearch. Loki doesn't full-text index logs by default, which reduces storage cost significantly.

Log everything in structured JSON format. Key fields to always include: level, request_id, service, user_id (where applicable). This makes filtering and correlation much easier.


Distributed tracing with OpenTelemetry

Traces are what let you follow a single user request as it crosses multiple service boundaries. Without tracing, when a user reports "the checkout is slow," you can't tell from metrics alone whether the problem is in the frontend service, the payment service, or the database.

OpenTelemetry (OTel) has become the standard instrumentation layer for distributed tracing. It provides:

For the backend, options include Jaeger (open-source, self-hosted), Grafana Tempo (integrates with the Prometheus/Loki stack), or Honeycomb/Datadog if you prefer managed SaaS.

Propagate trace context across service calls via HTTP headers (W3C Trace Context is the standard). This ties together all the spans from a single user request into one trace, even across dozens of services.


Kubernetes-specific metrics to watch

Beyond the standard RED metrics, there are several Kubernetes-specific signals that indicate cluster health problems before they cause user-visible failures:

Metric Why it matters
Pod restart countRestarts indicate crashes or OOM kills — a rising restart count is a leading indicator of instability
Pending podsPods stuck in Pending often mean resource exhaustion or node pressure — they can't be scheduled
OOMKilled containersContainer was killed for exceeding its memory limit — memory limit is set too low or there's a leak
Node pressure (CPU, memory, disk)When nodes hit pressure conditions, Kubernetes starts evicting pods
PVC usagePersistent volumes running full cause pod failures and data loss

External monitoring: the view from outside the cluster

Your Prometheus setup tells you what's happening inside the cluster. But it can't tell you what a user in Frankfurt experiences when they open your app — because it's running inside the same cluster you're monitoring.

External monitoring completes the picture. From outside the cluster, PingBase checks:

This catches things internal monitoring misses: a misconfigured ingress that breaks a specific path, a DNS propagation failure, a certificate expiry that internal checks don't surface because they bypass TLS.

Add one PingBase monitor per public-facing endpoint. Pair it with your status page so that during incidents, users have a place to check rather than filing support tickets.


Alerting strategy for Kubernetes

With all these signals, alerting discipline is critical. A few principles:

Add external monitoring to your Kubernetes setup

PingBase monitors your public endpoints from outside the cluster — catching what internal metrics miss. Free for up to 5 monitors, no credit card required.

Get started free →

Related