After setting up monitoring for dozens of clients across startups and enterprises, we have developed a framework that works. Most guides tell you to use Prometheus and Grafana. That is not wrong. But it does not tell you what to actually monitor.
This is the 10-layer monitoring framework we implement for our clients. It comes from years of learning what breaks in production and what warning signs to watch for.
Every environment is different. But these layers cover the fundamentals that apply to most Kubernetes setups.
The Layers
We break down monitoring into layers. Each layer answers a different question. If you skip a layer, you will have blind spots.
Here is how we think about it:
- System and Infrastructure
- Application Performance
- HTTP and API Endpoints
- Database
- Cache
- Message Queues
- Tracing Infrastructure
- SSL and Certificates
- External Dependencies
- Log Patterns and Errors
Let me walk through each one.
Layer 1: System and Infrastructure
This is the foundation. If the underlying infrastructure is unhealthy, nothing else matters.
We monitor two levels here: the nodes where pods run, and the pods themselves.
Node Level
Your pods run on nodes. If a node is struggling, your pods will struggle too.
We use Prometheus with Node Exporter to collect these metrics:
- CPU usage and load average
- Memory usage and available memory
- Disk usage and disk I/O
- Network I/O
- Node up or down status
- Kubelet health
A common mistake is only watching pod metrics. Your pod might look fine while the node it runs on is running out of disk space. We learned this the hard way.
Pod and Container Level
For pods, we track:
- Pod up or down status
- Container restart counts
- Resource requests vs actual usage
- Whether pods are hitting their memory or CPU limits
Kubernetes Error States
Kubernetes has specific error states that tell you something is wrong. We alert on these:
- CrashLoopBackOff - Container keeps crashing and restarting. Something is broken in the app or config.
- ImagePullBackOff - Cannot pull the container image. Registry issue or wrong image name.
- OOMKilled - Container ran out of memory and got killed. Need to increase limits or fix a memory leak.
- Pending pods - Pod is stuck and cannot be scheduled. Usually a resource or node selector issue.
- Evicted pods - Node ran out of resources and kicked out the pod.
- Failed liveness or readiness probes - App is not responding to health checks.
If you see CrashLoopBackOff in production, you want to know immediately. Not when a user complains.
Layer 2: Application Performance
System metrics tell you if the infrastructure is healthy. Application metrics tell you if the code is behaving.
This is where APM tools shine. We implement different tools based on client needs and budget:
Full APM platforms (metrics, traces, logs, errors):
- New Relic - Great free tier with 100GB/month. Good for startups. Shows code-level details, transaction traces, and error tracking out of the box.
- Datadog APM - Full stack visibility if you are already using Datadog for infrastructure. Gets expensive at scale.
- Dynatrace - Enterprise grade with AI-powered root cause analysis. Higher price point but less manual setup.
- SigNoz - Open source full APM alternative. Metrics, traces, and logs in one tool. Self-hosted.
- Elastic APM - Part of the Elastic stack. Good if you already use Elasticsearch.
Distributed tracing only:
- Jaeger - Tracing for microservices. Does not do metrics or logs. Pairs well with Prometheus for a complete setup.
- Zipkin - Similar to Jaeger. Lightweight tracing.
- Grafana Tempo - Tracing backend that integrates with Grafana. Works with Loki and Prometheus.
Instrumentation:
- OpenTelemetry - Vendor neutral. Instrument once, send data to any backend.
What we track with APM:
- Response times for each endpoint
- Error rates
- Transaction traces through the system
- Slow database queries
- Slow external API calls
The goal is to see what the application is actually doing. When a request is slow, you want to know if it is the code, the database, or an external service.
A trace showing a 3 second database query is more useful than a generic “high latency” alert.
Layer 3: HTTP and API Endpoints
This is blackbox monitoring with Prometheus. We do not care how the app works internally. We just check if it responds correctly from the outside.
What we probe:
- Health check endpoints
- Critical user flows like login, checkout, or search
- Response status codes
- Response latency
We run these probes from multiple locations. Your app might work fine from inside the cluster but be unreachable from the internet.
If the health endpoint returns 200, great. If it returns 500 or times out, something is wrong.
Layer 4: Database
Databases cause a lot of production issues. They deserve their own monitoring layer.
What we track:
- Active connections vs connection pool size
- Query latency
- Slow queries
- Replication lag (if you have replicas)
- Lock waits and deadlocks
- Disk and memory usage
- Database up or down status
Connection pool exhaustion is a classic issue. Your app starts throwing errors because it cannot get a database connection. If you track connection usage, you see it coming before it breaks.
Replication lag matters if you read from replicas. A replica that is 30 seconds behind will serve stale data.
Layer 5: Cache
We use Redis heavily. When Redis has issues, the app slows down or fails.
What we track:
- Hit and miss ratio
- Memory usage
- Eviction rate
- Connection count
- Redis up or down status
A dropping hit ratio means your cache is not working well. Either keys are expiring too fast or you are caching the wrong things.
High eviction rate means Redis is running out of memory and throwing away data. You need more memory or better cache policies.
Layer 6: Message Queues
We use message queues for async processing. If the queue backs up, work is not getting done.
What we track:
- Queue depth and lag
- Consumer lag
- Messages per second in and out
- Dead letter queue size
- Queue up or down status
Consumer lag is the big one. If your consumers cannot keep up with producers, the lag grows. Eventually you have a backlog of thousands of messages and users wondering why their jobs are not processing.
Dead letter queues catch failed messages. If that queue is growing, something is failing repeatedly.
Layer 7: Tracing Infrastructure
We run Jaeger for distributed tracing. But Jaeger itself needs monitoring.
What we track:
- Collector health and up or down status
- Span ingestion rate
- Storage backend health
- Dropped spans
If Jaeger is down, you lose visibility into your traces. You will not know until you need a trace and it is not there.
Dropped spans mean Jaeger cannot keep up. You are losing data.
Layer 8: SSL and Certificates
Expired certificates cause outages. Every single time it is embarrassing because it was preventable.
What we track:
- Certificate expiry dates
- Days until expiry (alert at 30, 14, and 7 days)
- Domain validation status
- TLS version
We alert at 30 days before expiry. That gives plenty of time to renew. Some teams wait until 7 days. That is asking for trouble if someone is on vacation.
Layer 9: External Dependencies
Your app probably depends on services you do not control. Payment providers. Auth services. Third party APIs.
What we track:
- Response times from external APIs
- Error rates from external calls
- Availability of critical third party services
When Stripe or Auth0 has issues, you want to know before your users do. Sometimes the problem is not your code. It is a dependency.
We keep a simple Grafana dashboard showing the health of all external services we depend on.
Layer 10: Log Patterns and Errors
Metrics tell you something is wrong. Logs tell you what is wrong.
We use centralized log management to monitor:
Error rate changes
- Sudden spike in 5xx errors
- Unusual increase in 4xx errors
- New error types appearing
Specific error patterns
We search for patterns that indicate real problems:
- “timeout” - Something is taking too long
- “connection refused” - Cannot connect to a service
- “deadlock” - Database contention issue
- “out of memory” - Memory pressure
- “disk full” - Storage issue
- “connection pool exhausted” - Need more connections
- “circuit breaker open” - Downstream service is failing
When we see a spike in timeout errors, we know to look at network or downstream services. When we see deadlock patterns, we check the database.
Pattern matching on logs catches issues that metrics miss.
What We Do Not Monitor
Being honest here. We do not monitor everything.
We skip:
- Per request logging in high traffic endpoints (too expensive)
- Debug level logs in production (too noisy)
- Metrics with high cardinality that blow up storage costs
There are trade offs. We accept some blind spots to keep costs reasonable and dashboards usable.
Alerting Philosophy
Not everything needs an alert. We follow a simple rule:
Alert on symptoms, not causes.
High CPU does not always mean a problem. Users getting errors is always a problem.
We have three levels:
- Page someone - Users are affected right now. 5xx errors, service down, data loss risk.
- Slack notification - Something needs attention today. Certificate expiring in 14 days, disk at 80%.
- Just log it - Interesting but not urgent. High CPU that did not cause user impact.
If an alert fires and we do nothing about it, we delete the alert. Alert fatigue is real. Every alert should mean something.
Wrapping Up
This is the framework we have refined over years of implementing monitoring for clients. Ten layers covering infrastructure, applications, dependencies, and errors.
The key is not having fancy tools. It is knowing what questions to ask:
- Is the infrastructure healthy?
- Is the application behaving?
- Can users actually use it?
- Are the dependencies working?
- What errors are happening?
If you can answer those questions from your monitoring, you are in good shape.
Happy to answer questions if you have them.
Need Help Setting This Up?
We implement this monitoring framework for clients running Kubernetes in production. Whether you are a startup needing a cost-effective setup or an enterprise requiring full observability, we can help.
Book a free 30-minute consultation to discuss your monitoring needs. No sales pitch - just an honest conversation about what would work for your setup.