Kubernetes

K8s Observability Stack: What We Deploy in 2026

The production Kubernetes observability stack we deploy in 2026. Prometheus Operator, Loki, OpenTelemetry, Tempo, eBPF tools, and how the pieces fit together.

Engineering Team
12 min read

Most “Kubernetes observability” articles list 20 tools and call it a day. Useful if you are writing a CNCF landscape report. Useless if you actually have to deploy something on Monday.

This is the stack we deploy. The specific tools at each layer, the versions running in production, the trade-offs we made, and the patterns that survive contact with real workloads.

If you are setting up Kubernetes observability for the first time, or you are unhappy with a stack that has grown organically over three years and is now both expensive and unreliable, this is the reference.

The stack at a glance

LayerWhat we deployWhy this and not that
Metrics scrapingPrometheus + Prometheus OperatorStandard, declarative, the most mature K8s observability primitive
Metric long-term storageThanos or Grafana MimirCross-cluster querying, S3-backed, no Prometheus retention pain
LogsGrafana Loki + Grafana AlloyLabel-based, S3-backed, cheap; OTel logs as the alternative
TracesGrafana Tempo + OpenTelemetry CollectorObject-storage-backed, integrates with Loki and Prometheus
Auto-instrumentationOpenTelemetry OperatorJava, Python, Node, .NET, Go - no app code changes
Eventskubernetes-event-exporter + FalcoK8s events to Loki; runtime security events
Network + kernel observabilityCilium Hubble + PixieeBPF-based, no sidecars, golden signals automatic
VisualizationGrafanaUniversal frontend for everything above
AlertingAlertmanagerStandard, native to Prometheus, integrates with PagerDuty/OpsGenie

That is the whole picture. The rest of this post explains why each choice and what changes at scale.

Why “K8s observability” is not just “monitoring with extra steps”

Before the tools, a quick reset on what makes K8s observability different from VM monitoring.

The shape of Kubernetes workloads breaks classic APM assumptions. Pods come and go. A 30-second-old container can be normal (Job) or alarming (CrashLoop). Cardinality explodes when you label metrics with pod names because pod names are ephemeral. Service discovery is dynamic. Network traffic is east-west more than north-south. The kernel and the orchestrator both have opinions about what happened to your packet.

The stack you build needs to handle all of this declaratively (because pods are managed declaratively), with high-cardinality controls (because labels multiply), and with service mesh and ingress integration (because that is where requests actually flow).

This is why we do not retrofit Datadog or a generic APM into Kubernetes and call it observability. The K8s-native stack matches the shape of the workloads. That is the entire reason for the existence of Kubernetes-specific observability.

Layer 1: Metrics scraping with Prometheus Operator

Prometheus is the foundation. Not vanilla Prometheus though - Prometheus Operator with the kube-prometheus-stack Helm chart.

Why the Operator and not vanilla Prometheus:

  • ServiceMonitor and PodMonitor CRDs mean monitoring lives in YAML alongside application manifests. Adding a service means adding a ServiceMonitor, not editing prometheus.yaml.
  • PrometheusRule CRD keeps alerts and recording rules in version control with the services they relate to.
  • Multi-replica with Thanos sidecar is built in for HA.
  • kube-state-metrics, node-exporter, and Alertmanager ship as part of the stack with sensible defaults.

The kube-prometheus-stack Helm chart deploys all of this in one shot. We Helm-install it via ArgoCD so configuration lives in Git, not in a cluster operator’s terminal history.

What we change from defaults:

  • Retention: Default is 10 days. We set Prometheus to 24-48 hours for local scraping and push everything older to Thanos or Mimir. Long retention on Prometheus pods is wasteful.
  • External labels: cluster: prod-eu-west-1 so multi-cluster querying makes sense.
  • Resource requests: kube-prometheus-stack defaults under-request memory. We typically set requests to 2-4 GiB on the Prometheus pod for a 50-node cluster.
  • NodeSelector / PodAntiAffinity: Keep Prometheus and Alertmanager off the same node.

Common mistake we fix in audits: people set Prometheus retention to 30 days, fill an EBS volume, and then have no way to query historical metrics because Prometheus is fighting itself. Short local retention plus long-term storage solves it.

Layer 2: Long-term storage with Thanos or Mimir

After about a week of metrics, you need somewhere cheaper to put them. The two production options that work:

Thanos (thanos.io) - sidecar-based, ships Prometheus blocks to S3 or GCS, runs Querier and Store Gateway for historical queries. Mature, battle-tested. The downside is operational complexity: you are running Sidecar, Store Gateway, Querier, Compactor, Receive, and Ruler if you want everything.

Grafana Mimir - the Cortex fork from Grafana Labs. Single-binary deployment option (or microservices), S3-backed, designed for multi-tenant scale. Easier to operate at the cost of less ecosystem maturity than Thanos.

Our default is Thanos for organizations already running Prometheus Operator and Mimir for greenfield deployments where operational simplicity wins.

Both store metric blocks in object storage. We use AWS S3 with lifecycle policies that move blocks to Glacier after 90 days for cold retention. Same applies on Azure (Blob Storage) and GCP (Cloud Storage).

Layer 3: Logs with Loki and Alloy

Logs are where stack choice becomes opinionated. The three real options in 2026:

  1. Grafana Loki with Alloy or Promtail collectors - what we deploy by default
  2. OpenTelemetry logs with the Collector - emerging, increasingly viable
  3. Elastic / OpenSearch - we deploy it when clients have existing Elastic investment

We deploy Loki because:

  • Label-based, not full-text-indexed. You query “give me logs from namespace=prod, app=api, level=error” not “match these words across all logs.” This makes Loki dramatically cheaper at scale - storage is roughly 1/10th of indexed alternatives.
  • S3-backed by default. Cheap, durable, no Elasticsearch cluster to operate.
  • Native Grafana integration. Same UI as metrics and traces. Correlation across signals is just clicking through.

What we configure:

  • Per-tenant label cardinality limits. Default Loki allows millions of label combinations per tenant. Without limits, a misconfigured app blows up your bill.
  • Retention by namespace. Production logs 90 days, staging 14 days, dev 3 days. We use the Loki retention API plus S3 lifecycle policies.
  • Alloy on every node as a DaemonSet. Alloy is the successor to Promtail and unifies metric scraping, log collection, and trace forwarding in one agent. Read our migration guide from Promtail to Alloy if you are running the older version.

When we do not use Loki:

  • Heavy text-search requirements where the search needs to span unstructured log content. Loki’s label-first approach is wrong for forensic text-mining; OpenSearch or Elastic wins there.
  • Compliance requirements that need specific certifications (FedRAMP, certain HIPAA interpretations) where a managed service is required.

Layer 4: Traces with Tempo and OpenTelemetry

Tracing is where most K8s observability stacks underinvest. Then a slow request becomes a six-engineer war room because nobody knows where time is spent.

Tempo for storage. Object-storage-backed, no indexing tax, integrates natively with Loki and Prometheus via Grafana’s “linked signals” view.

OpenTelemetry Operator for instrumentation. The Operator deploys auto-instrumentation agents per language: instrumentation.opentelemetry.io/inject-java: "true" as a pod annotation triggers a Java agent injection. Same for Python, Node, .NET, Go. No application code changes.

Sampling strategy matters more than the tool choice. We typically set:

  • Head sampling: 100% for low-volume services, 10% for high-volume, 1% for highest-volume (think hot paths in a payment gateway).
  • Tail sampling: enabled for errors and slow requests. Even if head-sampling drops a span, errored or slow traces get kept.
  • Sampling rules per route: health-check endpoints sampled at 0.1%, business endpoints at higher rates.

The OpenTelemetry Collector runs as a DaemonSet for the per-node collection layer and as a Deployment for the gateway tier. Gateway-tier collectors are where tail sampling and routing happens.

Layer 5: Events and runtime security

Two related but distinct streams:

Kubernetes events - the API server emits events when pods schedule, fail, get evicted, OOM, etc. Default lifetime is one hour. We deploy kubernetes-event-exporter to ship them to Loki. Now “why did this pod restart yesterday” is one query.

Runtime security events - what is actually happening inside containers at the syscall level. We deploy Falco for this. Detects things like a shell spawned inside a container, unexpected file access, suspicious network connections. Events stream to Falcosidekick which forwards to Loki, Slack, or PagerDuty.

A common mistake: teams deploy Falco then ignore the alerts because the default rules are noisy. Spend an hour tuning the rules and the signal-to-noise ratio is excellent. Untuned, you will tune Falco out of your alerting and lose runtime security visibility.

Layer 6: Network and eBPF

This is the layer most stacks skip and then desperately need when a production incident hits.

Cilium Hubble - if you run Cilium as your CNI (which we recommend for new clusters), Hubble is free. Real-time network flow visibility at L3, L4, and L7. “Which pod is talking to which pod, on what port, with what HTTP status code” answerable instantly.

Pixie - kernel-level observability via eBPF without sidecars or app changes. Auto-instruments HTTP, gRPC, DNS, Postgres, MySQL, Redis, Kafka, MongoDB at the kernel level. You get RED metrics (rate, errors, duration) per service automatically.

We do not deploy both on every cluster. Hubble for network-heavy workloads (microservices with east-west traffic) and Pixie when teams need automatic golden-signal instrumentation without touching code. There is some overlap; pick based on what your team actually needs.

We will go deeper on eBPF observability in our eBPF observability post.

Layer 7: Visualization with Grafana

Grafana is the universal frontend for everything above. Dashboards for metrics (Prometheus, Thanos), logs (Loki), traces (Tempo), and even k8s events through Loki integration.

What we deploy:

  • Grafana with file-based provisioning through the Grafana Operator. Dashboards live in Git, not in someone’s browser.
  • Standard dashboards we bundle: cluster health, workload golden signals, ingress, node resource pressure, control plane health, certificate expiry, top-N spenders by namespace.
  • SLO dashboards for every user-facing service. Burn rate alerts on the SLO error budget.
  • Auth via OIDC integrated with corporate IdP. No shared Grafana logins.

Read more about Grafana setup in our Grafana consulting service.

Layer 8: Alerting

Alertmanager bundled with Prometheus. Routing tree configured in PrometheusRule CRDs.

Two principles we follow:

  1. Symptom-based alerts, not cause-based. Alert on “user latency exceeded SLO” not “CPU on pod X is high.” The CPU might be a cause, but the symptom alert is what your on-call needs.
  2. Multi-burn-rate alerting on SLOs. Fast burn (1-hour and 5-minute windows) for urgent issues, slow burn (6-hour and 30-day windows) for non-page-worthy degradation.

Route alerts to PagerDuty or OpsGenie. Use Alertmanager’s grouping to suppress alert storms when a single root cause triggers cascading symptoms.

The starter stack vs the enterprise stack

You do not need everything from day one. Here is what makes sense at each scale:

Starter stack (1-2 clusters, under 50 nodes)

  • kube-prometheus-stack (Prometheus + Grafana + Alertmanager + kube-state-metrics + node-exporter)
  • Loki + Alloy for logs
  • That is it

Skip Thanos until retention or multi-cluster querying actually matters. Skip Tempo until you have a real tracing use case. Skip Pixie until automatic instrumentation is the bottleneck.

This stack runs in about 4-6 GiB of memory total on a 50-node cluster. The cluster has more important things to do than monitor itself.

Production stack (3-10 clusters, 200+ nodes total)

Add to the starter:

  • Thanos or Mimir for long-term metrics
  • Tempo for traces
  • OpenTelemetry Operator for auto-instrumentation
  • kubernetes-event-exporter for K8s event archival
  • Falco for runtime security
  • Cilium Hubble if using Cilium CNI

This is where the Production Kubernetes Cluster Setup Observability add-on lands you ($1,495 add-on on top of the $2,995 core setup).

Enterprise stack (10+ clusters, multi-region)

Add:

  • Pixie for kernel-level visibility
  • Multi-region Thanos or Mimir with global query view
  • Dedicated observability namespace with strict RBAC isolation
  • Backup and DR for the observability stack itself (it is critical infrastructure)
  • Cost observability (Kubecost or OpenCost) to track per-namespace observability spend

Common stack anti-patterns we keep fixing

Things we see in audits regularly:

  1. Prometheus retention set to 30+ days, then complaints about Prometheus crashing. Short local retention, long-term in object storage. Stop treating Prometheus as a database.
  2. No label cardinality limits in Loki. One bad service ships pod names as labels, blows up the chunk index, makes queries slow for everyone.
  3. OpenTelemetry instrumentation deployed but no sampling configured. Trace storage cost balloons. The team turns off tracing entirely. Configure sampling on day one.
  4. Grafana auth set to admin/admin. We have seen this in production. Configure OIDC before any dashboards exist.
  5. Falco deployed and ignored. Tune the rules or do not deploy it - silent runtime security is worse than no runtime security.
  6. Alertmanager pointing at a single Slack channel for everything. Route alerts. Group them. Suppress storms. Your on-call needs signal, not noise.
  7. No dashboards in Git. Dashboards exported via Grafana UI, lost when the Grafana pod restarts. Use GitOps with the Grafana Operator.

These show up in the Observability category of our 47-point Kubernetes production readiness checklist. If you want to self-assess, run the checklist; if you want a senior engineer to do it, the Kubernetes Production Readiness Audit is the $495 fixed-price version.

What we do NOT deploy

A few tools that show up in CNCF landscape posts that we either deprecate or skip:

  • EFK stack (Elasticsearch + Fluentd + Kibana): too expensive at scale, Loki replaces it for almost all K8s log use cases.
  • Promtail (standalone): deprecated in favor of Alloy. If you are still on Promtail, plan a migration.
  • Jaeger (standalone): still works fine, but Tempo integrates better with the Grafana stack we standardize on.
  • Cortex: superseded by Mimir.
  • Prometheus alerts via webhooks to email: low-signal, gets ignored. Real alerting goes through Alertmanager to PagerDuty or OpsGenie.

How this fits with your existing tools

You almost certainly already have some observability. The migration question is usually: “we already have Datadog or New Relic, why would we self-host this?”

The honest answer:

  • Cost: Self-hosted Prometheus + Loki + Tempo runs about 1/4 to 1/10 of equivalent Datadog spend at scale. The break-even is usually 200+ nodes or 5+ engineers worth of observability investigation time.
  • Data ownership: Telemetry data stays in your cloud account. For data residency-sensitive workloads (UK GDPR, KSA PDPL, UAE Federal Decree 45), this is often the deciding factor.
  • Vendor lock-in: The Grafana + Prometheus + Loki + Tempo stack is fully open source. You can switch backends without rewriting dashboards or alerts.

When we recommend staying with managed (Datadog, New Relic, Honeycomb): teams with under 50 engineers who cannot afford a dedicated platform engineering function, organizations where vendor support contracts are required, and clients who genuinely need features the open-source stack does not cover (specialized RUM, certain anomaly detection products).


Want help building this stack?

If you want to deploy this stack on your cluster without the operational learning curve, we offer two productized engagements:

  • Kubernetes Production Readiness Audit ($495, 2 weeks): scores your current observability posture against the Observability category of our 47-point checklist. Identifies what is missing, what is misconfigured, and what to fix in what order.
  • Production Kubernetes Cluster Setup ($2,995 core, $1,495 Observability add-on): production-ready EKS, AKS, or GKE cluster with the full observability stack deployed.

If you are running observability at scale and need ongoing operational support, our Kubernetes observability service covers the long-term operations.

Already have the stack mostly in place and want a self-assessment? The 47-point Kubernetes Production Readiness Checklist is published in full and ungated.

E

Engineering Team

Published on May 28, 2026

Continue exploring these related topics

Ready to get started?

Need Kubernetes expertise?

From architecture to production support, we help teams run Kubernetes reliably at scale.

Get started
Chat with real humans
Chat on WhatsApp