← Blog

DevOps April 6, 2026 12 min read

Monitor Your Microservices: A Complete Guide

Microservices distribute failure across many services instead of concentrating it in one. That's a feature — but it makes monitoring harder. Here's how to get visibility across your entire service mesh.

Monitoring a monolith is straightforward: one application, one health check, one deploy. When it's down, everything is down. When it's up, everything is up.

Microservices break this model. You have 5, 10, or 50 services, each with its own health, its own deployment cadence, and its own failure modes. A degraded recommendation service might not affect checkout. A broken notification service might silently fail to send emails while everything else runs normally. An unhealthy worker pool might be processing queue messages three times slower than expected without triggering any alert.

Effective microservices monitoring requires thinking about each layer: individual service health, inter-service dependencies, asynchronous workers, and the customer-visible aggregate.

Layer 1: Per-service health checks

Every service should expose a health endpoint. The standard is GET /health or GET /healthz (the Kubernetes convention). This endpoint should return 200 when the service is healthy and a non-200 status when it isn't.

A good health endpoint does more than return 200:

# Example health response

{
  "status": "ok",
  "version": "1.4.2",
  "uptime": 86400,
  "dependencies": {
    "database": "ok",
    "redis": "ok",
    "external_api": "degraded"
  }
}

This structure gives you at a glance: is the service healthy, what version is running, and which dependencies are contributing to any degradation. The dependencies block is particularly valuable — it lets you distinguish "this service is broken" from "this service is healthy but one of its dependencies is degraded."

In PingBase, set up one HTTP monitor per service pointing at its health endpoint. Use content validation to assert that "status":"ok" appears in the response body — this catches the case where a service returns 200 but reports itself as unhealthy.

Layer 2: Response time thresholds per service

In a microservices architecture, latency cascades. If Service A calls Service B which calls Service C, and Service C is slow, the slowness propagates up the call chain and multiplies. A service that's responding in 800ms when it normally takes 100ms may be masking a deeper dependency problem.

Set response time thresholds based on each service's baseline, not arbitrary round numbers:

Service type	Typical baseline	Suggested alert threshold
Health check endpoint	<50ms	200ms
Read-only data service	50–200ms	500ms
Write / mutation service	100–400ms	1000ms
Orchestration / gateway	200–500ms	2000ms
ML inference service	200ms–2s	5000ms

Calibrate these against your actual p95 response times, not the table above. The goal is to alert on meaningful deviation from normal, not to hit an arbitrary target.

Layer 3: Async workers and background services

Microservices architectures often have more async workers than synchronous services. Message consumers, event processors, scheduled aggregation jobs, data pipeline workers — these have no HTTP endpoint to monitor. They either run and process, or they silently stop.

Heartbeat monitoring is the right pattern. Each worker pings a unique URL after each successful processing cycle. If the ping stops arriving, the monitor alerts.

Common async components to monitor with heartbeats:

Message queue consumers — ping after processing each batch, or on a fixed interval if processing is continuous
Event stream processors — Kafka consumers, SQS workers, pub/sub subscribers
Data pipeline stages — ETL jobs that run on schedule
Scheduled aggregation — analytics rollup jobs, report generators
Cleanup and maintenance workers — log rotation, expired session cleanup, temp file deletion

# Example: Kafka consumer with heartbeat

async function processMessages() {
  while (true) {
    const messages = await consumer.poll({ timeout: 1000 });

    for (const message of messages) {
      await processMessage(message);
    }

    await consumer.commitOffsets();

    // Ping after each successful batch
    if (messages.length > 0) {
      await fetch(process.env.PINGBASE_HEARTBEAT_URL).catch(() => {});
    }
  }
}

// Also ping on a timer even when queue is empty
setInterval(() => {
  fetch(process.env.PINGBASE_HEARTBEAT_URL).catch(() => {});
}, 60_000); // Every minute

Layer 4: The API gateway or BFF

Most microservices architectures have an API gateway or backend-for-frontend (BFF) layer that aggregates calls to downstream services. This is the layer your users actually interact with.

Monitor the gateway differently from internal services:

Monitor from outside your network. Internal health checks tell you the service is running. External checks from PingBase tell you whether external users can actually reach it — catching firewall rules, CDN issues, and DNS problems that internal checks miss.
Test actual user-facing endpoints, not just /health. The gateway health endpoint might be fine while a specific route is broken due to a downstream service failure.
Assert response content, not just status codes. The gateway might return 200 with an error body when an upstream service is degraded.

Layer 5: The public status page

Your internal monitoring gives your engineering team visibility into service health. Your status page gives your users visibility. In a microservices architecture, the mapping between internal services and user-visible components isn't always 1:1.

Design your status page around user-visible functionality, not internal service names:

Internal services	Status page component
auth-service, session-service, user-service	Authentication
api-gateway, routing-service	API
notification-service, email-worker, template-service	Email & Notifications
billing-service, payment-processor-adapter	Billing
cdn, asset-service, frontend-app	Dashboard

Users don't know what auth-service is. They know whether they can log in. Structure your status page around their experience, and map your monitors to the appropriate component.

Monitoring during deployments

In a microservices architecture, deployments are continuous. Services deploy independently, often multiple times per day. Each deployment is a potential incident source.

Two patterns worth implementing:

Pause monitors during rolling deployments. When a service is deploying, individual instances restart sequentially. A monitor checking during this window might catch an instance mid-restart and fire a false alert. Use PingBase's GitHub Action to pause the relevant monitor during deployment and resume it after health checks pass.

# .github/workflows/deploy.yml

- name: Pause monitor during deploy
  uses: pingbase/pause-monitor@v1
  with:
    api-key: ${{ secrets.PINGBASE_API_KEY }}
    monitor-id: ${{ vars.PAYMENT_SERVICE_MONITOR_ID }}

- name: Deploy payment-service
  run: kubectl rollout restart deployment/payment-service

- name: Wait for rollout
  run: kubectl rollout status deployment/payment-service

- name: Resume monitor
  uses: pingbase/resume-monitor@v1
  with:
    api-key: ${{ secrets.PINGBASE_API_KEY }}
    monitor-id: ${{ vars.PAYMENT_SERVICE_MONITOR_ID }}

Post-deploy verification. After a deployment completes, run a targeted check against the health endpoint before resuming monitoring. If the check fails, roll back rather than resuming monitoring and waiting for an alert.

Organizing monitors at scale

When you have 20+ services, flat monitor lists become unmanageable. Use PingBase's monitor groups to organize by team, environment, or service tier:

By tier: Core (auth, API gateway, billing) vs. Supporting (notifications, analytics, recommendations)
By team: Platform, Product, Data
By environment: Production, Staging (with different alert thresholds)

For teams using the API to manage monitors programmatically, PingBase's REST API supports bulk creation and tagging — useful when spinning up a new service follows a repeatable pattern.

What PingBase gives you for microservices

HTTP monitors per service with response time thresholds and content validation
Heartbeat monitors for async workers with configurable periods and grace windows
DNS monitoring for service mesh entry points and internal service discovery
SSL monitoring per domain/subdomain
Monitor groups for organizing by team or tier
GitHub Action for pausing monitors during deployments
REST API and CLI for programmatic monitor management
Status page showing user-visible components, not internal service names
Multi-region checks to catch regional failures

The free tier covers 5 monitors — enough to cover the critical path (gateway, auth, billing) while evaluating. Pro at $9/month covers up to 10 monitors. Business at $29/month is unlimited.

Start monitoring your services

HTTP, heartbeat, DNS, and SSL monitoring in one tool. API and CLI for programmatic setup. Free to start.

Get started free →

How to Monitor Your API Endpoints

HTTP method selection, status codes, content validation for each service.

How to Monitor Cron Jobs and Background Tasks

Heartbeat monitoring with code examples for every language.

Multi-Region Monitoring and False Positives

How to eliminate alert noise without missing real incidents.