www.tasrieit.com
DevOps, Kubernetes & Cloud Consulting

The 10-Layer Kubernetes
Monitoring Checklist

The exact framework we use when auditing monitoring setups for clients running Kubernetes in production.

By Amjad Syed | Founder & CEO, Tasrie IT Services
1

System & Infrastructure

Is the underlying infrastructure healthy?

Tools: Prometheus + Node Exporter, kube-state-metrics
2

Application Performance

Is the code behaving correctly?

Tools: New Relic (free tier), Datadog APM, SigNoz (open source), Jaeger | Instrumentation: OpenTelemetry
3

HTTP, API & Real User Monitoring

Can users actually reach and use the application?

Synthetic: Blackbox Exporter, Checkly | API: Runscope, Postman Monitors | RUM: Datadog RUM, LogRocket, Sentry
4

Database

Is the database healthy and performing?

MetricWarningCritical
Connection pool usage70%90%
Replication lag10s60s
Query latency p95500ms2s
Tools: PostgreSQL Exporter, MySQL Exporter, PgHero, PMM
5

Cache

Is the cache working effectively?

Tools: Redis Exporter, Memcached Exporter, CloudWatch (ElastiCache)
6

Message Queues

Is async work getting processed?

Tools: Kafka Exporter, Burrow, RabbitMQ Prometheus Plugin, SQS Exporter
7

Tracing Infrastructure

Is your observability infrastructure healthy?

Tools: Built-in Jaeger/Tempo metrics, Prometheus
8

SSL & Certificates

Will certificates expire and cause an outage?

Tools: Blackbox Exporter (probe_ssl_earliest_cert_expiry), cert-manager
9

External Dependencies

Are third-party services working?

Tools: StatusGator, Instatus, Hyperping, your own probes
10

Log Patterns & Errors

What specific errors are happening?

Tools: Loki, Elasticsearch, CloudWatch Logs, Datadog Logs

Alerting Philosophy

Rule #1: Alert on symptoms, not causes. High CPU isn't always a problem. Users getting errors is always a problem.

Page Someone

Users affected NOW. 5xx errors, service down, data loss risk.

Channel: PagerDuty

Slack Notification

Needs attention today. Cert expiring in 14 days, disk at 80%.

Channel: Slack

Just Log It

Interesting but not urgent. High CPU without user impact.

Channel: Dashboard

Key principle: If an alert fires and you do nothing about it, delete the alert. Alert fatigue is real.

Common Mistakes to Avoid

Quick Reference: Tool Stacks

Budget-Friendly (Open Source)

  • • Prometheus + Grafana (metrics)
  • • Loki (logs)
  • • Jaeger or Tempo (traces)
  • • Blackbox Exporter (synthetic)
  • • Alertmanager → Slack

Enterprise Stack

  • • Datadog or New Relic (all-in-one)
  • • PagerDuty (alerting & on-call)
  • • Checkly (synthetic monitoring)
  • • StatusGator (external deps)
  • • LogRocket (session replay)
Tasrie IT

Need Help Implementing This?

We've implemented this exact framework for clients with 400+ servers.
Let us audit your monitoring setup and show you what you're missing.