Monitor Your Microservices: A Complete Guide
Microservices distribute failure across many services instead of concentrating it in one. That's a feature — but it makes monitoring harder. Here's how to get visibility across your entire service mesh.
Monitoring a monolith is straightforward: one application, one health check, one deploy. When it's down, everything is down. When it's up, everything is up.
Microservices break this model. You have 5, 10, or 50 services, each with its own health, its own deployment cadence, and its own failure modes. A degraded recommendation service might not affect checkout. A broken notification service might silently fail to send emails while everything else runs normally. An unhealthy worker pool might be processing queue messages three times slower than expected without triggering any alert.
Effective microservices monitoring requires thinking about each layer: individual service health, inter-service dependencies, asynchronous workers, and the customer-visible aggregate.
Layer 1: Per-service health checks
Every service should expose a health endpoint. The standard is GET /health or GET /healthz (the Kubernetes convention). This endpoint should return 200 when the service is healthy and a non-200 status when it isn't.
A good health endpoint does more than return 200:
# Example health response
{
"status": "ok",
"version": "1.4.2",
"uptime": 86400,
"dependencies": {
"database": "ok",
"redis": "ok",
"external_api": "degraded"
}
}
This structure gives you at a glance: is the service healthy, what version is running, and which dependencies are contributing to any degradation. The dependencies block is particularly valuable — it lets you distinguish "this service is broken" from "this service is healthy but one of its dependencies is degraded."
In PingBase, set up one HTTP monitor per service pointing at its health endpoint. Use content validation to assert that "status":"ok" appears in the response body — this catches the case where a service returns 200 but reports itself as unhealthy.
Layer 2: Response time thresholds per service
In a microservices architecture, latency cascades. If Service A calls Service B which calls Service C, and Service C is slow, the slowness propagates up the call chain and multiplies. A service that's responding in 800ms when it normally takes 100ms may be masking a deeper dependency problem.
Set response time thresholds based on each service's baseline, not arbitrary round numbers:
| Service type | Typical baseline | Suggested alert threshold |
|---|---|---|
| Health check endpoint | <50ms | 200ms |
| Read-only data service | 50–200ms | 500ms |
| Write / mutation service | 100–400ms | 1000ms |
| Orchestration / gateway | 200–500ms | 2000ms |
| ML inference service | 200ms–2s | 5000ms |
Calibrate these against your actual p95 response times, not the table above. The goal is to alert on meaningful deviation from normal, not to hit an arbitrary target.
Layer 3: Async workers and background services
Microservices architectures often have more async workers than synchronous services. Message consumers, event processors, scheduled aggregation jobs, data pipeline workers — these have no HTTP endpoint to monitor. They either run and process, or they silently stop.
Heartbeat monitoring is the right pattern. Each worker pings a unique URL after each successful processing cycle. If the ping stops arriving, the monitor alerts.
Common async components to monitor with heartbeats:
- Message queue consumers — ping after processing each batch, or on a fixed interval if processing is continuous
- Event stream processors — Kafka consumers, SQS workers, pub/sub subscribers
- Data pipeline stages — ETL jobs that run on schedule
- Scheduled aggregation — analytics rollup jobs, report generators
- Cleanup and maintenance workers — log rotation, expired session cleanup, temp file deletion
# Example: Kafka consumer with heartbeat
async function processMessages() {
while (true) {
const messages = await consumer.poll({ timeout: 1000 });
for (const message of messages) {
await processMessage(message);
}
await consumer.commitOffsets();
// Ping after each successful batch
if (messages.length > 0) {
await fetch(process.env.PINGBASE_HEARTBEAT_URL).catch(() => {});
}
}
}
// Also ping on a timer even when queue is empty
setInterval(() => {
fetch(process.env.PINGBASE_HEARTBEAT_URL).catch(() => {});
}, 60_000); // Every minute
Layer 4: The API gateway or BFF
Most microservices architectures have an API gateway or backend-for-frontend (BFF) layer that aggregates calls to downstream services. This is the layer your users actually interact with.
Monitor the gateway differently from internal services:
- Monitor from outside your network. Internal health checks tell you the service is running. External checks from PingBase tell you whether external users can actually reach it — catching firewall rules, CDN issues, and DNS problems that internal checks miss.
- Test actual user-facing endpoints, not just
/health. The gateway health endpoint might be fine while a specific route is broken due to a downstream service failure. - Assert response content, not just status codes. The gateway might return 200 with an error body when an upstream service is degraded.
Layer 5: The public status page
Your internal monitoring gives your engineering team visibility into service health. Your status page gives your users visibility. In a microservices architecture, the mapping between internal services and user-visible components isn't always 1:1.
Design your status page around user-visible functionality, not internal service names:
| Internal services | Status page component |
|---|---|
| auth-service, session-service, user-service | Authentication |
| api-gateway, routing-service | API |
| notification-service, email-worker, template-service | Email & Notifications |
| billing-service, payment-processor-adapter | Billing |
| cdn, asset-service, frontend-app | Dashboard |
Users don't know what auth-service is. They know whether they can log in. Structure your status page around their experience, and map your monitors to the appropriate component.
Monitoring during deployments
In a microservices architecture, deployments are continuous. Services deploy independently, often multiple times per day. Each deployment is a potential incident source.
Two patterns worth implementing:
Pause monitors during rolling deployments. When a service is deploying, individual instances restart sequentially. A monitor checking during this window might catch an instance mid-restart and fire a false alert. Use PingBase's GitHub Action to pause the relevant monitor during deployment and resume it after health checks pass.
# .github/workflows/deploy.yml
- name: Pause monitor during deploy
uses: pingbase/pause-monitor@v1
with:
api-key: ${{ secrets.PINGBASE_API_KEY }}
monitor-id: ${{ vars.PAYMENT_SERVICE_MONITOR_ID }}
- name: Deploy payment-service
run: kubectl rollout restart deployment/payment-service
- name: Wait for rollout
run: kubectl rollout status deployment/payment-service
- name: Resume monitor
uses: pingbase/resume-monitor@v1
with:
api-key: ${{ secrets.PINGBASE_API_KEY }}
monitor-id: ${{ vars.PAYMENT_SERVICE_MONITOR_ID }}
Post-deploy verification. After a deployment completes, run a targeted check against the health endpoint before resuming monitoring. If the check fails, roll back rather than resuming monitoring and waiting for an alert.
Organizing monitors at scale
When you have 20+ services, flat monitor lists become unmanageable. Use PingBase's monitor groups to organize by team, environment, or service tier:
- By tier: Core (auth, API gateway, billing) vs. Supporting (notifications, analytics, recommendations)
- By team: Platform, Product, Data
- By environment: Production, Staging (with different alert thresholds)
For teams using the API to manage monitors programmatically, PingBase's REST API supports bulk creation and tagging — useful when spinning up a new service follows a repeatable pattern.
What PingBase gives you for microservices
- HTTP monitors per service with response time thresholds and content validation
- Heartbeat monitors for async workers with configurable periods and grace windows
- DNS monitoring for service mesh entry points and internal service discovery
- SSL monitoring per domain/subdomain
- Monitor groups for organizing by team or tier
- GitHub Action for pausing monitors during deployments
- REST API and CLI for programmatic monitor management
- Status page showing user-visible components, not internal service names
- Multi-region checks to catch regional failures
The free tier covers 5 monitors — enough to cover the critical path (gateway, auth, billing) while evaluating. Pro at $9/month covers up to 10 monitors. Business at $29/month is unlimited.
Start monitoring your services
HTTP, heartbeat, DNS, and SSL monitoring in one tool. API and CLI for programmatic setup. Free to start.
Get started free →Related
How to Monitor Your API Endpoints
HTTP method selection, status codes, content validation for each service.
How to Monitor Cron Jobs and Background Tasks
Heartbeat monitoring with code examples for every language.
Multi-Region Monitoring and False Positives
How to eliminate alert noise without missing real incidents.