Free resource · Ungated

Kubernetes Production Readiness Checklist

47 production-readiness checks across six categories. The exact checklist a senior engineer would run against your production EKS, AKS, or GKE cluster - published in full, with no email gate or sign-up wall.

4.9★ Clutch ISO 27001
In 47 CHECKS
You'll Have

Security, Reliability, Scalability, Observability, Cost, Compliance

Every check links to a specific configuration or tool

Free for unlimited personal and internal team use

Same methodology as our paid $495 audit

47
production checks
6
categories
Free
ungated

How to use this checklist

Work through each section against one production cluster. For every item, mark it Pass, Partial, Fail, or N/A. Capture a one-line finding for any Partial or Fail item with a specific recommendation and an effort estimate.

Once all 47 items are scored, group findings by category and score each category 0-100 based on how many items passed. This gives you an executive scorecard that anyone non-technical can interpret.

Prioritize remediation by blast radius (what breaks if this fails) and likelihood (how probable the failure is). High blast radius and high likelihood = critical, fix first. Build a 30-60-90-day remediation roadmap from there.

The 47 checks

Six categories. Every check linked to a specific configuration, tool, or operational practice.

Security

10 checks

RBAC, Pod Security, NetworkPolicies, secrets, image provenance, CIS benchmark.

  1. 1

    RBAC scope

    No cluster-admin bound to ServiceAccounts; least-privilege Roles per namespace.

  2. 2

    Pod Security Admission

    Enforced at namespace level; restricted profile for production workloads.

  3. 3

    NetworkPolicies

    Default-deny baseline with explicit ingress and egress allows per workload.

  4. 4

    Secret encryption at rest

    Cloud KMS-backed encryption, not just base64-encoded etcd entries.

  5. 5

    Image provenance

    Private registry plus signature verification via cosign or sigstore.

  6. 6

    ServiceAccount token hygiene

    Tokens not auto-mounted unless the workload actually needs API access.

  7. 7

    Container security context

    runAsNonRoot, readOnlyRootFilesystem, drop ALL capabilities by default.

  8. 8

    CIS Kubernetes Benchmark

    kube-bench score reviewed; failing controls prioritized in remediation.

  9. 9

    API server audit logging

    Enabled and routed to a SIEM or long-term log store, not just stdout.

  10. 10

    Runtime vulnerability scanning

    Trivy or Grype scanning running images with severity-based gates.

Reliability

8 checks

PDBs, anti-affinity, probes, resource limits, graceful shutdown, etcd backups.

  1. 11

    PodDisruptionBudgets

    Present on every stateful and critical-path workload to survive drains.

  2. 12

    Multi-zone distribution

    Topology spread constraints or anti-affinity rules across AZs.

  3. 13

    Autoscaling configured

    HPA or KEDA with realistic min and max, targeting workload-appropriate metrics.

  4. 14

    Resource requests and limits

    Set on every container; no unbounded workloads competing for node resources.

  5. 15

    Probes tuned correctly

    Liveness, readiness, and startup probes calibrated to avoid restart stampedes.

  6. 16

    Graceful shutdown

    terminationGracePeriodSeconds plus preStop hooks where needed for clean exits.

  7. 17

    Image pull policy

    No IfNotPresent with mutable tags in production; immutable tags preferred.

  8. 18

    etcd backup and restore

    Automated backups plus a restore procedure tested within the last 90 days.

Scalability

7 checks

Cluster Autoscaler / Karpenter, HPA, VPA, node pool sizing, CoreDNS, CNI mode.

  1. 19

    Cluster Autoscaler or Karpenter

    Configured, healthy, and tuned to workload arrival patterns.

  2. 20

    Metrics-server health

    Running and serving HPA queries reliably; custom metrics flowing where needed.

  3. 21

    Vertical Pod Autoscaler

    Evaluated for right-sizing; recommend mode at minimum on top workloads.

  4. 22

    Node pool sizing

    Workload profile matches node SKU; no oversized general-purpose nodes for batch.

  5. 23

    Ingress controller scaling

    Horizontally scaled with PDBs to handle traffic peaks without outages.

  6. 24

    CoreDNS configuration

    Replica count and autoscaler tuned to cluster size; no DNS bottlenecks at scale.

  7. 25

    CNI and kube-proxy mode

    IPVS vs iptables decision documented; appropriate for cluster scale.

Observability

8 checks

Metrics, logs, traces, alerts, dashboards, SLOs, cost observability, audit logs.

  1. 26

    Metrics collection

    Prometheus or managed equivalent collecting cluster, workload, and control-plane metrics.

  2. 27

    Log aggregation

    Logs centralized in Loki, Elastic, CloudWatch, or Azure Monitor with retention.

  3. 28

    Distributed tracing

    OpenTelemetry or service-mesh tracing across top user-facing services.

  4. 29

    Actionable alerting

    Alerts deduplicated and routed to on-call; no noisy alerts ignored by the team.

  5. 30

    Operational dashboards

    Cluster health, workload latency, error rate, and saturation visible at a glance.

  6. 31

    SLOs and error budgets

    Defined for the top three user-facing services with error budget tracking.

  7. 32

    Cost observability

    Kubecost, OpenCost, or cloud-native equivalent providing per-namespace cost data.

  8. 33

    Audit and access log retention

    Retained per regulatory and incident-response requirements; 90 days minimum.

Cost

6 checks

Right-sizing, spot nodes, idle resources, PV lifecycle, egress, RIs / Savings Plans.

  1. 34

    Right-sizing gap

    Declared CPU and memory requests compared against P95 actual usage.

  2. 35

    Spot and preemptible usage

    Used for fault-tolerant workloads; baseline on-demand for stateful services.

  3. 36

    Idle resource detection

    Unused PVs, orphaned LoadBalancers, zero-replica deployments identified.

  4. 37

    Persistent Volume lifecycle

    Snapshot policy, retention, and tier-down for cold data implemented.

  5. 38

    Egress cost analysis

    Cross-AZ chatter, NAT gateway charges, and internet egress quantified and optimized.

  6. 39

    Reserved Instance coverage

    Savings Plans or Reserved Instances applied to baseline node consumption.

Compliance

8 checks

Data residency, encryption, audit retention, SSO, change management, DR.

  1. 40

    Data residency

    Workloads and storage in the correct region for GDPR, PDPL, NESA, or PIPEDA.

  2. 41

    Encryption in transit

    TLS at ingress plus workload-to-workload encryption; mTLS via service mesh preferred.

  3. 42

    Encryption at rest

    PVs, etcd, Secrets, and container registry all encrypted with managed keys.

  4. 43

    Audit log retention

    Retention meets the regulatory floor for the buyer jurisdiction.

  5. 44

    SSO and OIDC for kubectl

    Corporate IdP integration; no shared kubeconfig files or long-lived service tokens.

  6. 45

    GitOps change management

    PR approval workflow; no kubectl apply from engineer laptops in production.

  7. 46

    Vulnerability remediation SLA

    Documented and met; Critical CVEs patched within 7 days, High within 30.

  8. 47

    Disaster recovery

    RTO and RPO documented; DR drill executed within the last 90 days.

Productized service

Want a senior engineer to run this for you?

The checklist tells you what to check. Running it accurately, prioritizing findings by business risk, and producing a board-ready report is the harder part. We do all of that for a fixed $495 in 2 weeks.

$495
USD · fixed price · 2 weeks
See the audit details →

Frequently asked questions

About the checklist and how to use it.

Can I use this checklist for free?

Yes. This is the same checklist a senior engineer would run during a paid audit. We publish it ungated because the checklist itself is not the hard part - running it accurately, prioritizing findings by business risk, and producing a board-ready report is. If you want help with the running and reporting part, see the paid Kubernetes Production Readiness Audit.

How long does it take to run this myself?

A platform engineer who knows the cluster can typically work through the 47 items in 2-3 working days, plus another day to write up findings. A senior engineer who has run dozens of audits can do it in 1-2 days because pattern recognition is faster.

Do I need to check every item on every cluster?

Not necessarily. Smaller clusters or non-production workloads may legitimately skip some items (for example, multi-region or disaster recovery checks may be N/A for a single-region pre-production environment). The checklist is a maximum list, not a minimum requirement.

What clusters does this apply to?

Amazon EKS, Azure AKS, and Google GKE - the three managed Kubernetes services. Most items also apply to other Kubernetes distributions, but the specific tooling and cloud-region recommendations are written for EKS, AKS, and GKE.

How do I prioritize findings I uncover?

Score each finding by two dimensions: blast radius (what breaks if this fails - one workload, a namespace, the whole cluster) and likelihood (how probable the failure is given current configuration). Critical findings are high blast radius plus high likelihood. The paid audit produces this scoring automatically with effort estimates.

More free tools and resources

ROI calculators for DevOps, Kubernetes, business automation, and data analytics - all ungated.

Chat with real humans
Chat on WhatsApp