Free resource · Ungated

Kubernetes Production Readiness Checklist

47 production-readiness checks across six categories. The exact checklist a senior engineer would run against your production EKS, AKS, or GKE cluster - published in full, with no email gate or sign-up wall.

Start the checklist Pay us to run it: $495

4.9★ Clutch ISO 27001

In 47 CHECKS

You'll Have

Security, Reliability, Scalability, Observability, Cost, Compliance

Every check links to a specific configuration or tool

Free for unlimited personal and internal team use

Same methodology as our paid $495 audit

production checks

How to use this checklist

Work through each section against one production cluster. For every item, mark it Pass, Partial, Fail, or N/A. Capture a one-line finding for any Partial or Fail item with a specific recommendation and an effort estimate.

Once all 47 items are scored, group findings by category and score each category 0-100 based on how many items passed. This gives you an executive scorecard that anyone non-technical can interpret.

Prioritize remediation by blast radius (what breaks if this fails) and likelihood (how probable the failure is). High blast radius and high likelihood = critical, fix first. Build a 30-60-90-day remediation roadmap from there.

The 47 checks

Six categories. Every check linked to a specific configuration, tool, or operational practice.

Security

10 checks

RBAC, Pod Security, NetworkPolicies, secrets, image provenance, CIS benchmark.

1

RBAC scope

No cluster-admin bound to ServiceAccounts; least-privilege Roles per namespace.
2

Pod Security Admission

Enforced at namespace level; restricted profile for production workloads.
3

NetworkPolicies

Default-deny baseline with explicit ingress and egress allows per workload.
4

Secret encryption at rest

Cloud KMS-backed encryption, not just base64-encoded etcd entries.
5

Image provenance

Private registry plus signature verification via cosign or sigstore.
6

ServiceAccount token hygiene

Tokens not auto-mounted unless the workload actually needs API access.
7

Container security context

runAsNonRoot, readOnlyRootFilesystem, drop ALL capabilities by default.
8

CIS Kubernetes Benchmark

kube-bench score reviewed; failing controls prioritized in remediation.
9

API server audit logging

Enabled and routed to a SIEM or long-term log store, not just stdout.
10

Runtime vulnerability scanning

Trivy or Grype scanning running images with severity-based gates.

Reliability

8 checks

PDBs, anti-affinity, probes, resource limits, graceful shutdown, etcd backups.

11

PodDisruptionBudgets

Present on every stateful and critical-path workload to survive drains.
12

Multi-zone distribution

Topology spread constraints or anti-affinity rules across AZs.
13

Autoscaling configured

HPA or KEDA with realistic min and max, targeting workload-appropriate metrics.
14

Resource requests and limits

Set on every container; no unbounded workloads competing for node resources.
15

Probes tuned correctly

Liveness, readiness, and startup probes calibrated to avoid restart stampedes.
16

Graceful shutdown

terminationGracePeriodSeconds plus preStop hooks where needed for clean exits.
17

Image pull policy

No IfNotPresent with mutable tags in production; immutable tags preferred.
18

etcd backup and restore

Automated backups plus a restore procedure tested within the last 90 days.

Scalability

7 checks

Cluster Autoscaler / Karpenter, HPA, VPA, node pool sizing, CoreDNS, CNI mode.

19

Cluster Autoscaler or Karpenter

Configured, healthy, and tuned to workload arrival patterns.
20

Metrics-server health

Running and serving HPA queries reliably; custom metrics flowing where needed.
21

Vertical Pod Autoscaler

Evaluated for right-sizing; recommend mode at minimum on top workloads.
22

Node pool sizing

Workload profile matches node SKU; no oversized general-purpose nodes for batch.
23

Ingress controller scaling

Horizontally scaled with PDBs to handle traffic peaks without outages.
24

CoreDNS configuration

Replica count and autoscaler tuned to cluster size; no DNS bottlenecks at scale.
25

CNI and kube-proxy mode

IPVS vs iptables decision documented; appropriate for cluster scale.

Observability

8 checks

Metrics, logs, traces, alerts, dashboards, SLOs, cost observability, audit logs.

26

Metrics collection

Prometheus or managed equivalent collecting cluster, workload, and control-plane metrics.
27

Log aggregation

Logs centralized in Loki, Elastic, CloudWatch, or Azure Monitor with retention.
28

Distributed tracing

OpenTelemetry or service-mesh tracing across top user-facing services.
29

Actionable alerting

Alerts deduplicated and routed to on-call; no noisy alerts ignored by the team.
30

Operational dashboards

Cluster health, workload latency, error rate, and saturation visible at a glance.
31

SLOs and error budgets

Defined for the top three user-facing services with error budget tracking.
32

Cost observability

Kubecost, OpenCost, or cloud-native equivalent providing per-namespace cost data.
33

Audit and access log retention

Retained per regulatory and incident-response requirements; 90 days minimum.

Cost

6 checks

Right-sizing, spot nodes, idle resources, PV lifecycle, egress, RIs / Savings Plans.

34

Right-sizing gap

Declared CPU and memory requests compared against P95 actual usage.
35

Spot and preemptible usage

Used for fault-tolerant workloads; baseline on-demand for stateful services.
36

Idle resource detection

Unused PVs, orphaned LoadBalancers, zero-replica deployments identified.
37

Persistent Volume lifecycle

Snapshot policy, retention, and tier-down for cold data implemented.
38

Egress cost analysis

Cross-AZ chatter, NAT gateway charges, and internet egress quantified and optimized.
39

Reserved Instance coverage

Savings Plans or Reserved Instances applied to baseline node consumption.

Compliance

8 checks

Data residency, encryption, audit retention, SSO, change management, DR.

40

Data residency

Workloads and storage in the correct region for GDPR, PDPL, NESA, or PIPEDA.
41

Encryption in transit

TLS at ingress plus workload-to-workload encryption; mTLS via service mesh preferred.
42

Encryption at rest

PVs, etcd, Secrets, and container registry all encrypted with managed keys.
43

Audit log retention

Retention meets the regulatory floor for the buyer jurisdiction.
44

SSO and OIDC for kubectl

Corporate IdP integration; no shared kubeconfig files or long-lived service tokens.
45

GitOps change management

PR approval workflow; no kubectl apply from engineer laptops in production.
46

Vulnerability remediation SLA

Documented and met; Critical CVEs patched within 7 days, High within 30.
47

Disaster recovery

RTO and RPO documented; DR drill executed within the last 90 days.

Productized service

Want a senior engineer to run this for you?

The checklist tells you what to check. Running it accurately, prioritizing findings by business risk, and producing a board-ready report is the harder part. We do all of that for a fixed $495 in 2 weeks.

$495

USD · fixed price · 2 weeks

See the audit details →

Frequently asked questions

About the checklist and how to use it.

Can I use this checklist for free?

Yes. This is the same checklist a senior engineer would run during a paid audit. We publish it ungated because the checklist itself is not the hard part - running it accurately, prioritizing findings by business risk, and producing a board-ready report is. If you want help with the running and reporting part, see the paid Kubernetes Production Readiness Audit.

How long does it take to run this myself?

A platform engineer who knows the cluster can typically work through the 47 items in 2-3 working days, plus another day to write up findings. A senior engineer who has run dozens of audits can do it in 1-2 days because pattern recognition is faster.

Do I need to check every item on every cluster?

Not necessarily. Smaller clusters or non-production workloads may legitimately skip some items (for example, multi-region or disaster recovery checks may be N/A for a single-region pre-production environment). The checklist is a maximum list, not a minimum requirement.

What clusters does this apply to?

Amazon EKS, Azure AKS, and Google GKE - the three managed Kubernetes services. Most items also apply to other Kubernetes distributions, but the specific tooling and cloud-region recommendations are written for EKS, AKS, and GKE.

How do I prioritize findings I uncover?

Score each finding by two dimensions: blast radius (what breaks if this fails - one workload, a namespace, the whole cluster) and likelihood (how probable the failure is given current configuration). Critical findings are high blast radius plus high likelihood. The paid audit produces this scoring automatically with effort estimates.

More free tools and resources

ROI calculators for DevOps, Kubernetes, business automation, and data analytics - all ungated.

Browse free tools

Kubernetes Production Readiness Checklist

How to use this checklist

The 47 checks

Security

RBAC scope

Pod Security Admission

NetworkPolicies

Secret encryption at rest

Image provenance

ServiceAccount token hygiene

Container security context

CIS Kubernetes Benchmark

API server audit logging

Runtime vulnerability scanning

Reliability

PodDisruptionBudgets

Multi-zone distribution

Autoscaling configured

Resource requests and limits

Probes tuned correctly

Graceful shutdown

Image pull policy

etcd backup and restore

Scalability

Cluster Autoscaler or Karpenter

Metrics-server health

Vertical Pod Autoscaler

Node pool sizing

Ingress controller scaling

CoreDNS configuration

CNI and kube-proxy mode

Observability

Metrics collection

Log aggregation

Distributed tracing

Actionable alerting

Operational dashboards

SLOs and error budgets

Cost observability

Audit and access log retention

Cost

Right-sizing gap

Spot and preemptible usage

Idle resource detection

Persistent Volume lifecycle

Egress cost analysis

Reserved Instance coverage

Compliance

Data residency

Encryption in transit

Encryption at rest

Audit log retention

SSO and OIDC for kubectl

GitOps change management

Vulnerability remediation SLA

Disaster recovery

Want a senior engineer to run this for you?

Frequently asked questions

More free tools and resources

Get a Free Kubernetes Health Check

What you'll get

Claim your free health check

You're in.

Tasrie IT Support

Start a conversation