eBPF Observability on K8s: We Replaced Sidecars

The first time we replaced a service mesh sidecar with eBPF instrumentation, the cluster’s CPU dropped 18% overnight. No application code changed. The team thought something was broken until we showed them the dashboards.

This is the post we wish existed when we started moving Kubernetes observability off sidecars and onto eBPF in 2024. What eBPF actually gives you, where it does not replace traditional instrumentation, the production gotchas, and the three tools that matter: Cilium Hubble, Pixie, and Parca.

What eBPF is, in 90 seconds

eBPF (extended Berkeley Packet Filter) runs sandboxed programs in the Linux kernel. Originally for network packet filtering, it now powers networking (Cilium replacing kube-proxy), security (Falco, Tetragon), and observability (Pixie, Hubble, Parca).

The point for Kubernetes observability: eBPF programs see every syscall, every network packet, every CPU sample - across every pod on the node - without modifying applications or adding sidecars.

That last part is the unlock. Traditional Kubernetes observability requires one of two things:

App-level instrumentation: rewrite your code with OpenTelemetry, Prometheus client libraries, etc.
Sidecars: inject Envoy or similar proxies next to every pod to capture traffic.

Both work. Both cost something. App instrumentation requires engineering time and code ownership. Sidecars consume CPU, memory, and add a network hop per request. At scale (200+ pods), sidecars become a meaningful portion of your cluster’s spend.

eBPF replaces these two patterns for many observability use cases. Not all - we will get to the limits.

The 60-second verdict

Use case	Right tool
HTTP/gRPC golden signals (rate, errors, latency) without app changes	Pixie
Pod-to-pod network flow visibility (L3, L4, L7)	Cilium Hubble
Continuous CPU profiling across the cluster	Parca
Service mesh features beyond observability (mTLS, retries, traffic shifting)	Sidecar-based mesh (Istio sidecar, Linkerd)
Application-level business metrics (orders processed, queue depth)	Prometheus client library in app
Distributed trace context propagation	OpenTelemetry SDK in app

eBPF is excellent at what the kernel can see. It is not magic for things only the application knows.

Why we keep replacing sidecars with eBPF

Concrete numbers from real migrations:

Service mesh sidecar removal (Istio with Envoy, then ambient mode + eBPF)

Metric	With Envoy sidecars	With ambient + eBPF
Per-pod overhead	80-120 MiB memory, ~50-100 millicores CPU	<10 MiB memory, <10 millicores CPU
Inter-pod network hops	2 (app → sidecar → network → sidecar → app)	0 (app → network → app)
P99 added latency per hop	2-5ms	<0.5ms
Cluster spend for 500 pods	~$3,200/month sidecar overhead	~$300/month eBPF overhead

That is not an exotic edge case. That is what happens when you replace 500 Envoy sidecars with Istio ambient mode running on eBPF.

Pixie auto-instrumentation versus manual OpenTelemetry rollout:

Activity	Manual OTel	Pixie via eBPF
Engineer-hours to instrument 30 services	60-120 hours	0 hours
Time to first HTTP latency histogram	2-6 weeks	<30 minutes
Code changes required	Yes (per service)	None
Trade-off	Full custom span control	Generic golden signals only

The trade-off is real. Pixie gives you HTTP/gRPC golden signals and protocol metrics automatically. It does not give you custom span attributes or business-context tracing. For that you still need OpenTelemetry. Most teams need both: eBPF for the baseline, OTel for the parts your code knows.

Tool 1: Cilium Hubble for network observability

Cilium is an eBPF-based CNI (Container Network Interface). It replaces kube-proxy and the default Linux iptables-based networking with eBPF programs that route traffic in the kernel.

Hubble is Cilium’s observability layer. If you run Cilium as your CNI, Hubble is “free” - already on every node, capturing every flow.

What Hubble gives you:

Real-time flow visibility: every packet between every pod, with source pod name, destination pod name, protocol, port, and L7 details (HTTP status code, gRPC method).
DNS observability: which pods resolved which hostnames, including DNS resolution failures.
NetworkPolicy enforcement visibility: which traffic was allowed, which was dropped, and which policy caused the drop.
Service map: auto-generated graph of pod-to-pod and service-to-service communication.

When we deploy Cilium + Hubble:

New clusters on EKS, AKS, or GKE where we can choose the CNI. Hubble is the default observability for any new cluster we build via our Production Kubernetes Cluster Setup.
Clusters with significant east-west traffic where the existing CNI (Amazon VPC CNI, Azure CNI, GCP CNI defaults) leaves you blind to pod-to-pod flows.
Anywhere NetworkPolicies are enforced and someone needs to debug “why is this connection failing.”

When we do not:

Existing production clusters where the CNI is deeply embedded and migration risk is high. CNI changes are not free.
Clusters where the cloud provider’s native CNI is a hard requirement (some compliance frameworks specifically reference VPC-native networking).

Hubble UI is decent for ad-hoc exploration. For dashboarding we export Hubble metrics to Prometheus and visualize in Grafana.

Tool 2: Pixie for golden signals without code changes

Pixie (donated to CNCF by New Relic) is the auto-instrumentation tool we deploy when teams want HTTP, gRPC, and database protocol metrics without touching application code.

How it works: an eBPF program attaches to syscalls and network functions in the kernel. It parses HTTP requests, gRPC calls, Postgres queries, MySQL queries, Redis commands, Kafka traffic, DNS, MongoDB - all without the application knowing.

What Pixie gives you, automatically:

HTTP RED metrics (rate, errors, duration) per service, per route, per HTTP status code
gRPC method-level metrics for any service using gRPC
Database query observability: top queries by frequency, top slow queries, error rates
Service map with edge-level golden signals
Profile-on-demand with continuous CPU profiles per service

What we configure:

Long-term storage: Pixie’s default is in-cluster, last 24 hours. We export to Prometheus and Loki for longer retention.
Data egress policy: by default Pixie sends some telemetry to its cloud. Self-hosted mode keeps everything local. For regulated workloads (UK GDPR, KSA PDPL), self-hosted is the only acceptable option.

The thing nobody warns you about: Pixie consumes 5-15% of node CPU on busy nodes. eBPF is not free - the kernel program is running on every packet and syscall. We have seen Pixie deployments slow down latency-sensitive workloads. Always test on a staging cluster before production.

The right pattern: Pixie on the cluster, but with selective namespace targeting. Apply it to your top 10-20 critical services, not to every namespace by default.

Tool 3: Parca for continuous profiling

Parca gives you continuous CPU profiles across the entire cluster, with eBPF. No language-specific agent, no profiler library in your app. The eBPF program samples stack traces from the kernel.

Why this matters: traditional profiling is reactive. Something is slow, you turn on the profiler, you reproduce the slowness, you look at the profile. By the time you do all that, the incident is over.

Parca is always-on. Every pod has continuous CPU sampling. When something gets slow, you go back in time and look at what was burning CPU during the incident. No reproduction needed.

When we deploy Parca:

Latency-sensitive services where unexplained CPU spikes are a recurring pattern
Memory leak hunts (Parca also does heap profiles for some languages)
Cost optimization investigations - which workloads actually use the CPU they request

When we skip:

Workloads where CPU is never the bottleneck (most stateless web tier, where I/O dominates)
Tiny clusters where one engineer can manually profile when needed

Parca writes profiles to object storage. The data is searchable by service, time range, and stack trace pattern.

Where eBPF does not replace traditional instrumentation

The honest gaps:

Custom application metrics: eBPF cannot see “number of orders processed” because that is a business concept the kernel does not understand. Prometheus client library in the app is the right tool.

Distributed trace context propagation: traces span multiple services. The kernel sees individual requests, not the full causal chain. You need OpenTelemetry SDK in the application to propagate trace IDs through headers.

TLS payload inspection: encrypted traffic is encrypted, even to the kernel. eBPF sees connection establishment and metadata but not the HTTP body inside TLS. Some tools (Cilium with TLS interception, Pixie with key extraction) can work around this in specific setups, but it adds complexity and security implications.

Application-defined error categorization: Pixie sees an HTTP 500 but does not know if it is “expected upstream failure” or “data corruption.” The app has to label that.

Browser / client-side observability: eBPF is server-side. Real User Monitoring still needs client SDKs.

We use eBPF and OTel together. eBPF for the automatic baseline; OTel for the parts that need application context.

Production gotchas we have hit

Kernel version requirements: eBPF features are kernel-version-dependent. Cilium requires Linux 5.4+ for full feature parity. Pixie requires 4.14+ but works better on 5.x. AWS Bottlerocket and most modern node AMIs are fine; older Amazon Linux 2 AMIs may not be.

Node CPU overhead: as mentioned, eBPF programs run on every relevant kernel event. On busy nodes, this is a measurable cost. We always benchmark on staging before rolling out to production.

Cluster autoscaling interaction: Cilium has a known scale-out lag when new nodes join the cluster. The first few seconds of a new node’s pods may see transient connectivity issues. We have a runbook for this; if you are running Karpenter with aggressive scale-out, plan for it.

Pixie’s data egress: by default, Pixie sends some metadata to its cloud (since it was originally a New Relic product). Self-hosted mode is the safe default for regulated industries. Pin this in your deployment.

eBPF + LSM hooks security implications: eBPF programs run in the kernel. They are sandboxed (the eBPF verifier prevents most disasters), but a privileged eBPF deployment is effectively a node-level capability. Restrict who can deploy eBPF programs with strict RBAC and admission controls.

Multi-architecture support: eBPF works on AMD64 and ARM64. Most tools support both. We verify ARM64 compatibility before deploying to Graviton or Ampere-based nodes.

The architecture pattern that works

For most production clusters we build, the layered observability stack with eBPF looks like:

Layer 1 (Network): Cilium CNI with Hubble
Layer 2 (Workload): Pixie for automatic HTTP/gRPC/DB metrics
Layer 3 (CPU): Parca for continuous profiling
Layer 4 (Custom signals): OpenTelemetry SDK in apps for business metrics + traces
Layer 5 (Logs): Grafana Alloy collecting container logs to Loki
Layer 6 (Aggregation): Prometheus Operator scraping eBPF tool metrics + custom metrics
Layer 7 (Visualization): Grafana correlating across all signals
Layer 8 (Storage): Thanos or Mimir for long-term metrics, S3 for traces and logs

eBPF is layers 1-3. Application-level instrumentation handles layer 4. Everything else is the standard observability stack.

This is the pattern we documented in our Kubernetes observability stack post and what we deploy via our Kubernetes observability service.

What to deploy first if you are starting from zero

If you have no eBPF observability today and want to start, the order:

Pixie first. Single Helm install, immediate value, no CNI changes. If Pixie does not work in your environment (kernel version, resource budget), you discover it before you commit further.
Cilium second if you have CNI flexibility. Hubble comes free with it. This is a heavier change because it replaces your CNI - test extensively in staging.
Parca third when you have a specific CPU-profiling use case. It is high-value but more niche than the first two.

Skip Pixie entirely if you cannot tolerate the per-node CPU cost or you have strict data residency requirements that self-hosted mode does not satisfy (rare, but real).

Vendor and managed options

For teams that want eBPF observability without operating the stack:

Grafana Cloud k6 + Grafana Beyla: Grafana’s eBPF-based auto-instrumentation. Lighter touch than Pixie, integrates natively with Grafana Cloud.
Datadog Universal Service Monitoring (USM): Datadog’s eBPF agent. Works if you are already a Datadog customer.
New Relic OpenTelemetry collector with eBPF: New Relic’s eBPF instrumentation.
Isovalent Enterprise for Cilium: commercial Cilium with enterprise support.

We typically deploy and operate the open-source stack for clients who want to own their telemetry data. For organizations with mature vendor relationships and no platform engineering capacity, the managed options are reasonable.

What we tell clients in audits

We see a lot of K8s clusters during the Production Readiness Audit. Three patterns:

No eBPF anywhere, full sidecar mesh: highest opportunity for cost reduction. Path to recommend: migrate to ambient mesh (Istio ambient or Linkerd policy controller without sidecars) over 2-3 sprints. Result: 70-80% reduction in mesh overhead.
eBPF deployed but unconfigured: Pixie running in default mode, ingesting more data than the team will ever use, costing CPU. Path: namespace-target the busy services, drop the rest.
eBPF in production but not in observability stack: Cilium running for networking, but Hubble metrics not flowing to Prometheus. Easy fix, big visibility win.

Most clusters fall into pattern #1 or #3. Both are worth the change.

Want help deploying eBPF observability?

If you are running Kubernetes and want eBPF observability done right - the right tools for your workload, sized correctly, integrated with the rest of your monitoring stack - that is exactly what we deploy.

Kubernetes Observability service: our dedicated K8s-native observability practice. eBPF deployment, Prometheus Operator integration, Loki and Tempo, all production-grade.
Production Kubernetes Cluster Setup ($2,995): new EKS, AKS, or GKE cluster with the Observability add-on ($1,495) deploys Pixie, Cilium Hubble, and the full Prometheus stack from day one.
Kubernetes Production Readiness Audit ($495): if you have observability but are not sure if it is working, the audit’s Observability category surfaces what is missing, what is over-deployed, and what to fix in what order.

Free self-assessment? The 47-point Kubernetes Production Readiness Checklist includes the Observability category in full.

eBPF Observability on K8s: We Replaced Sidecars

What eBPF is, in 90 seconds

The 60-second verdict

Why we keep replacing sidecars with eBPF

Tool 1: Cilium Hubble for network observability

Tool 2: Pixie for golden signals without code changes

Tool 3: Parca for continuous profiling

Where eBPF does not replace traditional instrumentation

Production gotchas we have hit

The architecture pattern that works

What to deploy first if you are starting from zero

Vendor and managed options

What we tell clients in audits

Want help deploying eBPF observability?

K8s Observability Stack: What We Deploy in 2026

Resize Kubernetes Pods Without a Restart: 1.35 Is GA

CloudNativePG 1.29 Features Most Production Teams Miss

Best Kubernetes Backup Tools in 2026 (9 Options Compared)

Velero vs Kasten K10 vs Portworx PX-Backup: 2026 Comparison

Need Kubernetes expertise?

Tasrie IT Support

Start a conversation

What eBPF is, in 90 seconds

The 60-second verdict

Why we keep replacing sidecars with eBPF

Tool 1: Cilium Hubble for network observability

Tool 2: Pixie for golden signals without code changes

Tool 3: Parca for continuous profiling

Where eBPF does not replace traditional instrumentation

Production gotchas we have hit

The architecture pattern that works

What to deploy first if you are starting from zero

Vendor and managed options

What we tell clients in audits

Want help deploying eBPF observability?

Related Articles

K8s Observability Stack: What We Deploy in 2026

Resize Kubernetes Pods Without a Restart: 1.35 Is GA

CloudNativePG 1.29 Features Most Production Teams Miss

Best Kubernetes Backup Tools in 2026 (9 Options Compared)

Velero vs Kasten K10 vs Portworx PX-Backup: 2026 Comparison

Need Kubernetes expertise?

One Production Insight a Week

What you'll get

Subscribe to weekly insights

You're subscribed.

Tasrie IT Support

Start a conversation