Kubernetes is powerful but complex. After implementing Kubernetes clusters for dozens of organizations, we’ve seen the same mistakes repeated across companies of all sizes. These mistakes lead to outages, security breaches, cost overruns, and frustrated engineering teams.
This guide shares the 10 most critical Kubernetes mistakes we’ve encountered, explains why they happen, and provides actionable solutions based on real-world experience. Learn from others’ mistakes to avoid repeating them.
Table of Contents
- Not Setting Resource Requests and Limits
- Running Everything as Root
- No Network Policies (Default Allow-All)
- Ignoring Pod Disruption Budgets
- Storing Secrets in Git
- Not Implementing Health Checks
- Treating Kubernetes Like VMs
- No Disaster Recovery or Backup Strategy
- Over-Engineering from Day One
- Neglecting Observability and Monitoring
Mistake #1: Not Setting Resource Requests and Limits
The Problem
What happens:
- Pods scheduled on nodes without considering actual resource needs
- Single pod consuming entire node’s resources
- Other pods starved of CPU/memory, causing cascading failures
- Unpredictable costs from over-provisioning
- Cluster autoscaler making poor decisions
Real-world example: A SaaS company experienced a production outage when a memory leak in one application consumed all available memory on a node, causing the kubelet to crash and all pods on that node to fail. The issue cascaded as traffic shifted to remaining nodes, overwhelming them as well.
Cost: 4.5 hours downtime, $180K in lost revenue
Why It Happens
- Developers unfamiliar with resource management
- “It works in development” mentality
- Avoiding the effort of profiling actual usage
- Fear of setting limits that might restrict performance
The Solution
Set appropriate resource requests and limits:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
containers:
- name: api
image: myapi:1.0
resources:
requests:
cpu: "200m" # Minimum guaranteed CPU
memory: "256Mi" # Minimum guaranteed memory
limits:
cpu: "500m" # Maximum CPU (throttled beyond this)
memory: "512Mi" # Maximum memory (OOMKilled beyond this)
How to determine appropriate values:
- Start with VPA recommendations:
kubectl apply -f - <<EOF
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updateMode: "Off" # Recommendation only
EOF
# Get recommendations after a few days
kubectl describe vpa api-vpa
- Monitor actual usage with Prometheus:
# 95th percentile memory usage over 7 days
quantile_over_time(0.95, container_memory_working_set_bytes{container="api"}[7d])
# 95th percentile CPU usage over 7 days
quantile_over_time(0.95, rate(container_cpu_usage_seconds_total{container="api"}[5m])[7d:5m])
- Set requests to p95, limits to 2x requests
- Use LimitRanges to enforce defaults:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default: # Default limits if not specified
cpu: "500m"
memory: "512Mi"
defaultRequest: # Default requests if not specified
cpu: "100m"
memory: "128Mi"
max: # Maximum allowed
cpu: "4000m"
memory: "8Gi"
min: # Minimum required
cpu: "50m"
memory: "64Mi"
type: Container
Related: Our Kubernetes cost optimization guide covers right-sizing in detail.
Mistake #2: Running Everything as Root
The Problem
What happens:
- Containers run with UID 0 (root)
- Full access to node filesystem if container escapes
- Privilege escalation attacks possible
- Compliance failures (PCI DSS, HIPAA)
- Audit findings and security violations
Real-world example: A financial services company failed a PCI DSS audit because 80% of their containers ran as root. The auditor identified this as a critical finding, blocking their ability to process credit card payments until remediated. Fixing required 6 weeks of work across 100+ microservices.
Cost: $450K in remediation effort, delayed product launch
Why It Happens
- Default Dockerfile doesn’t create non-root user
- “It just works” with root (file permissions, port binding)
- Developers unaware of security implications
- Legacy applications expecting root access
The Solution
1. Update Dockerfiles to create and use non-root user:
FROM node:18-alpine
# Create app user
RUN addgroup -g 1001 app && \
adduser -u 1001 -G app -s /bin/sh -D app
# Set working directory and ownership
WORKDIR /app
COPY --chown=app:app . .
# Install dependencies
RUN npm ci --only=production
# Switch to non-root user
USER app
EXPOSE 3000
CMD ["node", "server.js"]
2. Enforce non-root in Kubernetes:
apiVersion: v1
kind: Pod
metadata:
name: secure-app
spec:
securityContext:
runAsNonRoot: true # Reject if user is root
runAsUser: 1001 # Explicit UID
fsGroup: 2000 # Volume ownership
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myapp:1.0
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE # Only if binding to ports < 1024
3. Use Pod Security Standards to enforce:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
4. Audit existing workloads:
# Find pods running as root
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] |
select(.spec.securityContext.runAsUser == 0 or
(.spec.securityContext.runAsUser == null and
.spec.containers[].securityContext.runAsUser == null)) |
"\(.metadata.namespace)/\(.metadata.name)"'
Related: Our Kubernetes security best practices guide covers this in depth.
Mistake #3: No Network Policies (Default Allow-All)
The Problem
What happens:
- All pods can communicate with all other pods
- Compromised pod can access entire cluster
- Lateral movement for attackers
- No segmentation or isolation
- Compliance failures
Real-world example: An e-commerce platform had a vulnerability in their public-facing web application. Attackers compromised a web pod and used it to directly access the internal database pod, exfiltrating customer payment information. The breach went undetected for 3 weeks.
Cost: $2.3M in fines and legal settlements, severe reputation damage
Why It Happens
- Default Kubernetes networking is allow-all
- “We’ll add network policies later” (never happens)
- Complexity of defining policies
- Fear of breaking applications
The Solution
1. Start with default deny-all policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {} # Applies to all pods
policyTypes:
- Ingress
- Egress
2. Allow specific traffic patterns:
---
# Allow web tier to access API tier
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: web-to-api
namespace: production
spec:
podSelector:
matchLabels:
tier: api
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
tier: web
ports:
- protocol: TCP
port: 8080
---
# Allow API tier to access database
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-to-database
namespace: production
spec:
podSelector:
matchLabels:
tier: database
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
tier: api
ports:
- protocol: TCP
port: 5432
---
# Allow egress for DNS and external APIs
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-egress
namespace: production
spec:
podSelector:
matchLabels:
tier: api
policyTypes:
- Egress
egress:
- to: # Database
- podSelector:
matchLabels:
tier: database
ports:
- protocol: TCP
port: 5432
- to: # DNS
- namespaceSelector:
matchLabels:
name: kube-system
- podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
- to: # External HTTPS
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
3. Test network policies before enforcement:
# Test connectivity from web pod to api pod
kubectl exec -n production web-pod -- curl -v http://api-service:8080/health
# Test connectivity from web pod to database (should fail)
kubectl exec -n production web-pod -- nc -zv database-service 5432
4. Audit network policies:
# Check which namespaces have network policies
kubectl get networkpolicies --all-namespaces
# Visualize policies (using kubectl-netpol plugin)
kubectl netpol viz -n production
5. Use Cilium for advanced L7 policies:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: api-l7-policy
namespace: production
spec:
endpointSelector:
matchLabels:
tier: api
ingress:
- fromEndpoints:
- matchLabels:
tier: web
toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http:
- method: "GET"
path: "/api/v1/.*"
- method: "POST"
path: "/api/v1/orders"
Mistake #4: Ignoring Pod Disruption Budgets
The Problem
What happens:
- Node drains during maintenance kill all replicas simultaneously
- Cluster autoscaler scales down nodes with critical pods
- Upgrades cause application downtime
- Zero availability during voluntary disruptions
Real-world example: A SaaS company performed routine node upgrades during business hours. The upgrade drained nodes without respecting application availability, taking down all replicas of their authentication service simultaneously. Users couldn’t log in for 15 minutes during peak usage.
Cost: 15 minutes downtime, 12,000 affected users, reputation damage
Why It Happens
- Developers unaware Pod Disruption Budgets (PDB) exist
- Assuming Kubernetes “just handles” high availability
- No testing of maintenance procedures
- Lack of operational documentation
The Solution
1. Create Pod Disruption Budgets for critical services:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 2 # Always keep at least 2 pods running
selector:
matchLabels:
app: api
tier: api
Or use percentage:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
maxUnavailable: "30%" # Allow maximum 30% unavailable
selector:
matchLabels:
app: api
2. Ensure sufficient replicas:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 5 # With minAvailable: 2, allows draining 3 at once
template:
metadata:
labels:
app: api
tier: api
3. Test node drains:
# Drain node safely (respects PDBs)
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
# Check PDB status
kubectl get pdb -n production
kubectl describe pdb api-pdb -n production
# If PDB blocks drain, check why
kubectl get pods -n production -l app=api -o wide
4. Set PDBs for all stateful workloads:
---
# Database PDB
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: database-pdb
namespace: production
spec:
minAvailable: 2 # Always keep 2 replicas (master + 1 standby)
selector:
matchLabels:
app: postgresql
---
# Cache PDB
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: redis-pdb
namespace: production
spec:
maxUnavailable: 1 # Allow only 1 Redis pod down at a time
selector:
matchLabels:
app: redis
5. Anti-affinity for replica distribution:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 5
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: api
topologyKey: kubernetes.io/hostname # Spread across nodes
Mistake #5: Storing Secrets in Git
The Problem
What happens:
- Database passwords, API keys in Git repository
- Secrets exposed in commit history forever
- Public repository leak = immediate breach
- Compliance violations (SOC 2, ISO 27001)
Real-world example: A startup accidentally pushed AWS credentials to a public GitHub repository. Within 12 minutes, attackers discovered the credentials and spun up $50,000 worth of EC2 instances for cryptocurrency mining. GitHub’s automated scanning alerted them, but damage was done.
Cost: $50,000 in fraudulent charges, security incident response, credential rotation
Why It Happens
- Convenience (“it’s just a private repo”)
- Lack of secrets management solution
- Unclear boundaries between config and secrets
- Accidental commits (.env files)
The Solution
1. Use external secrets management:
Option A: HashiCorp Vault
apiVersion: v1
kind: Pod
metadata:
name: app
annotations:
vault.hashicorp.com/agent-inject: "true"
vault.hashicorp.com/role: "app-role"
vault.hashicorp.com/agent-inject-secret-database: "database/creds/app"
spec:
serviceAccountName: app
containers:
- name: app
image: myapp:1.0
env:
- name: DB_PASSWORD
value: "file:///vault/secrets/database"
Option B: External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: aws-secrets-manager
namespace: production
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: app
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: database-credentials
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: database-secret
creationPolicy: Owner
data:
- secretKey: password
remoteRef:
key: production/database
property: password
2. Use Sealed Secrets for GitOps:
# Install Sealed Secrets controller
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml
# Create sealed secret (safe to commit)
kubectl create secret generic mysecret \
--from-literal=password=supersecret \
--dry-run=client -o yaml | \
kubeseal -o yaml > sealed-secret.yaml
# Commit sealed-secret.yaml to Git safely
git add sealed-secret.yaml
git commit -m "Add database credentials"
3. Scan repositories for secrets:
# Install Gitleaks
brew install gitleaks
# Scan repository
gitleaks detect --source . --verbose
# Pre-commit hook
cat > .git/hooks/pre-commit << 'EOF'
#!/bin/bash
gitleaks protect --staged --verbose
EOF
chmod +x .git/hooks/pre-commit
4. Rotate compromised secrets immediately:
# If secrets leaked:
# 1. Rotate credentials immediately
# 2. Update secret in secrets manager
# 3. Restart pods to pick up new secret
kubectl rollout restart deployment/app -n production
# 4. Rewrite Git history (if exposed in commits)
git filter-branch --force --index-filter \
"git rm --cached --ignore-unmatch .env" \
--prune-empty --tag-name-filter cat -- --all
Mistake #6: Not Implementing Health Checks
The Problem
What happens:
- Kubernetes sends traffic to pods that aren’t ready
- Pods in crash loop restart appear “healthy” briefly
- Application deadlocks undetected
- Users experience errors despite “healthy” cluster
Real-world example: An API service had a subtle database connection pool exhaustion issue. The pod remained running, but all API requests timed out. Without proper health checks, Kubernetes continued routing traffic to the broken pod for 8 minutes until manually restarted.
Cost: 8 minutes of 50% error rate, customer complaints
Why It Happens
- “The pod is running, so it must be working”
- Lack of understanding liveness vs readiness probes
- Difficulty implementing health check logic
- Fear health checks will cause false positives
The Solution
1. Implement both liveness and readiness probes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
containers:
- name: api
image: myapi:1.0
ports:
- containerPort: 8080
# Liveness: Is the application alive?
# Failure = restart the container
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30 # Wait for app to start
periodSeconds: 10 # Check every 10 seconds
timeoutSeconds: 5
failureThreshold: 3 # Restart after 3 failures
# Readiness: Is the application ready for traffic?
# Failure = remove from service endpoints
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2 # Remove after 2 failures
# Startup: Special probe for slow-starting apps
startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 30 # Allow 300 seconds to start
2. Implement comprehensive health check endpoints:
// /healthz endpoint (liveness)
func healthzHandler(w http.ResponseWriter, r *http.Request) {
// Basic health check - is the app alive?
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}
// /readyz endpoint (readiness)
func readyzHandler(w http.ResponseWriter, r *http.Request) {
// Check all dependencies
checks := []struct {
name string
fn func() error
}{
{"database", checkDatabase},
{"cache", checkCache},
{"message_queue", checkMessageQueue},
}
for _, check := range checks {
if err := check.fn(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
fmt.Fprintf(w, "NOT READY: %s failed: %v\n", check.name, err)
return
}
}
w.WriteHeader(http.StatusOK)
w.Write([]byte("READY"))
}
func checkDatabase() error {
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
return db.PingContext(ctx)
}
3. Use exec probes for non-HTTP applications:
livenessProbe:
exec:
command:
- /bin/grpc_health_probe
- -addr=:9090
initialDelaySeconds: 30
periodSeconds: 10
4. Monitor probe failures:
# Alert on high probe failure rate
rate(kube_pod_container_status_restarts_total[5m]) > 0.1
Mistake #7: Treating Kubernetes Like VMs
The Problem
What happens:
- SSH into containers to debug
- Manual configuration changes inside running containers
- Stateful data stored in container filesystem
- Long-lived pods treated like long-lived VMs
- Expecting containers to survive restarts
Real-world example: An operations team was manually SSH-ing into containers to apply configuration changes and hotfixes. When a node failed, all their manual changes were lost, and they couldn’t reproduce the production configuration. It took 6 hours to rebuild the correct state.
Cost: 6 hours outage, loss of configuration state
Why It Happens
- VM mindset from previous infrastructure
- “It’s faster to just SSH in and fix it”
- Lack of understanding Kubernetes declarative model
- Insufficient automation and GitOps
The Solution
1. Embrace immutable infrastructure:
# ❌ Bad: Logging into container and making changes
kubectl exec -it api-pod -- /bin/bash
# (makes manual changes inside container)
# ✅ Good: Update deployment manifest, redeploy
# Edit deployment.yaml
kubectl apply -f deployment.yaml
# Or use GitOps (ArgoCD, Flux)
2. Use ConfigMaps and Secrets for configuration:
# Configuration as code
apiVersion: v1
kind: ConfigMap
metadata:
name: api-config
data:
app.conf: |
log_level: info
max_connections: 100
cache_ttl: 300
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
containers:
- name: api
image: myapi:1.0
volumeMounts:
- name: config
mountPath: /etc/app/
volumes:
- name: config
configMap:
name: api-config
3. Debug without exec:
# View logs instead of SSH
kubectl logs api-pod -n production --tail=100 -f
# Port-forward for debugging
kubectl port-forward api-pod 8080:8080
# Copy files out for analysis
kubectl cp production/api-pod:/tmp/debug.log ./debug.log
# Use ephemeral debug containers (Kubernetes 1.23+)
kubectl debug api-pod -it --image=busybox --target=api
4. Use GitOps for declarative management:
# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: api
spec:
source:
repoURL: https://github.com/myorg/k8s-manifests
path: production/api
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
5. Use ReadOnlyRootFilesystem:
apiVersion: v1
kind: Pod
metadata:
name: api
spec:
containers:
- name: api
image: myapi:1.0
securityContext:
readOnlyRootFilesystem: true # Prevent writes to container FS
volumeMounts:
- name: tmp
mountPath: /tmp # Writable temp directory
- name: cache
mountPath: /var/cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}
Mistake #8: No Disaster Recovery or Backup Strategy
The Problem
What happens:
- Accidental kubectl delete deletes production resources
- Cluster failure with no recovery plan
- etcd corruption with no backup
- Persistent volumes lost with no snapshots
Real-world example:
A developer accidentally ran kubectl delete namespace production instead of kubectl delete namespace production-test. The entire production environment was deleted. Without backups, they had to recreate everything from scratch, taking 14 hours and losing some stateful data permanently.
Cost: 14 hours downtime, permanent data loss, $420K in lost revenue
Why It Happens
- “It won’t happen to us” mentality
- Backup perceived as low priority
- Complexity of distributed system backups
- Lack of DR planning and testing
The Solution
1. Use Velero for cluster-level backups:
# Install Velero
velero install \
--provider aws \
--bucket velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--use-node-agent
# Create backup schedule
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--ttl 168h \
--include-namespaces production,staging
# Backup specific application
velero backup create api-backup \
--selector app=api \
--include-namespaces production
# Restore from backup
velero restore create --from-backup daily-backup-20260109
2. Enable persistent volume snapshots:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: daily-snapshots
driver: ebs.csi.aws.com
deletionPolicy: Delete
parameters:
tags: "environment=production"
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: database-snapshot
namespace: production
spec:
volumeSnapshotClassName: daily-snapshots
source:
persistentVolumeClaimName: database-pvc
3. Implement GitOps (infrastructure as code):
# All Kubernetes manifests in Git
myrepo/
├── base/
│ ├── deployments/
│ ├── services/
│ └── kustomization.yaml
├── overlays/
│ ├── production/
│ └── staging/
└── README.md
# Disaster recovery = recreate from Git
kubectl apply -k overlays/production/
4. Test disaster recovery procedures:
# Quarterly DR drill
# 1. Backup production
velero backup create dr-test
# 2. Create test cluster
eksctl create cluster --name dr-test
# 3. Restore backup
velero restore create --from-backup dr-test
# 4. Verify applications
kubectl get pods --all-namespaces
curl https://test-api.example.com/health
# 5. Document findings and update DR runbook
5. Protect against accidental deletion:
# ResourceQuota with terminationGracePeriodSeconds
apiVersion: v1
kind: ResourceQuota
metadata:
name: namespace-protection
namespace: production
spec:
hard:
persistentvolumeclaims: "100"
scopeSelector:
matchExpressions:
- operator: Exists
scopeName: Terminating
# RBAC to limit delete permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: developer
namespace: production
rules:
- apiGroups: ["", "apps"]
resources: ["pods", "deployments", "services"]
verbs: ["get", "list", "watch", "create", "update", "patch"]
# Notice: "delete" is NOT included
Mistake #9: Over-Engineering from Day One
The Problem
What happens:
- Complex service mesh before understanding requirements
- Multi-cluster federation for single application
- Operators and CRDs for simple workloads
- Team overwhelmed by tooling complexity
- Long time-to-production
Real-world example: A startup with 3 engineers and 5 microservices implemented Istio service mesh, ArgoCD, Crossplane, Cilium, OPA Gatekeeper, and Vault on day one. They spent 8 weeks on infrastructure with zero application progress. When an engineer left, no one understood the entire system.
Cost: 8 weeks of development time, technical debt, team frustration
Why It Happens
- Resume-driven development
- “Best practices” without context
- Fear of future scaling problems
- Cargo culting from large companies
The Solution
1. Start simple, add complexity when needed:
Week 1-2: Basic Kubernetes
# Just Deployments, Services, Ingress
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
template:
spec:
containers:
- name: api
image: myapi:1.0
---
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
ports:
- port: 80
targetPort: 8080
Month 2: Add observability
- Prometheus + Grafana
- Basic alerts
Month 3: Add autoscaling
- HPA for pods
- Cluster Autoscaler
Month 4-6: Add based on actual needs
- Service mesh (if you have 50+ microservices)
- GitOps (when manual deploys become bottleneck)
- Advanced security (when compliance requires it)
2. Use managed services:
# ❌ Self-hosting everything
- Self-managed Prometheus
- Self-managed Grafana
- Self-managed Vault
- Self-managed ArgoCD
# ✅ Use managed when available
- AWS CloudWatch (or AWS Managed Prometheus)
- AWS Managed Grafana
- AWS Secrets Manager
- Simple kubectl apply (initially)
3. Avoid premature optimization:
# ❌ Over-engineered for 3 services
- Service mesh with mTLS
- Multi-cluster federation
- Custom operators for everything
- Complex Crossplane compositions
# ✅ Simple and maintainable
- Plain Kubernetes networking
- Single cluster
- Standard Kubernetes resources
- Cloud provider integrations
4. Decision framework:
| Tool | Use When… | Avoid If… |
|---|---|---|
| Service Mesh | 50+ microservices, complex traffic patterns | < 20 services, simple architecture |
| GitOps | 5+ engineers, frequent deploys | 1-2 engineers, infrequent changes |
| Crossplane | Multi-cloud, complex infrastructure | Single cloud, simple needs |
| Operators | Managing stateful systems at scale | Stateless apps, simple deployments |
Mistake #10: Neglecting Observability and Monitoring
The Problem
What happens:
- Issues discovered by users, not monitoring
- No visibility into application performance
- Debugging requires kubectl exec and guessing
- Mean time to resolution (MTTR) measured in hours
- No capacity planning data
Real-world example: A platform experienced gradual performance degradation over 3 days as database connections slowly leaked. Without proper monitoring, they only discovered the issue when the system completely failed during peak traffic. Post-mortem showed clear metrics trending negatively for days—but no one was watching.
Cost: 6 hours complete outage, $280K lost revenue
Why It Happens
- “We’ll add monitoring later” (never happens)
- Perceived as operational overhead
- Developers unfamiliar with observability tools
- No ownership of production health
The Solution
1. Implement the three pillars of observability:
Metrics (Prometheus):
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-metrics
namespace: production
spec:
selector:
matchLabels:
app: api
endpoints:
- port: metrics
interval: 30s
path: /metrics
Logs (Loki/ELK):
# Fluent Bit config for log aggregation
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
[OUTPUT]
Name loki
Match kube.*
Host loki.logging.svc
Port 3100
Traces (Jaeger/Tempo):
# OpenTelemetry Collector
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
jaeger:
endpoint: jaeger.tracing.svc:14250
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
2. Define and monitor SLOs:
# Example SLO: 99.9% of requests < 500ms latency
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-slos
spec:
groups:
- name: api-latency-slo
interval: 30s
rules:
- record: api:latency:slo_ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
- alert: SLOBudgetBurn
expr: api:latency:slo_ratio < 0.999
for: 5m
annotations:
summary: "API latency SLO burning ({{ $value }})"
3. Create actionable alerts:
# Good alert (actionable)
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "{{ $labels.service }} error rate is {{ $value | humanizePercentage }}"
runbook: "https://wiki.example.com/runbooks/high-error-rate"
# Bad alert (noisy, not actionable)
- alert: PodRestarted
expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
# Fires constantly, no context, unclear action
4. Build comprehensive dashboards:
{
"dashboard": {
"title": "API Service Health",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (status)"
}
]
},
{
"title": "Latency (p50, p95, p99)",
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
}
]
},
{
"title": "Resource Usage (CPU, Memory)",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{pod=~\"api-.*\"}[5m]))"
},
{
"expr": "sum(container_memory_working_set_bytes{pod=~\"api-.*\"})"
}
]
}
]
}
}
Related: Our Kubernetes consulting services include observability implementation.
Bonus Mistake: Not Learning from Incidents
The Problem
Issues repeat because no post-mortems are conducted or lessons aren’t applied.
The Solution
1. Conduct blameless post-mortems:
# Incident Post-Mortem Template
## Incident Summary
- Date: 2026-01-09
- Duration: 45 minutes
- Impact: 15% error rate on API
- Severity: P1
## Timeline
- 14:32: Alert fired for high error rate
- 14:35: Engineer acknowledged alert
- 14:40: Root cause identified (database connection pool exhausted)
- 15:05: Increased connection pool size
- 15:17: Service recovered
## Root Cause
Database connection pool configured for 20 max connections, insufficient for current traffic (150 req/s).
## Contributing Factors
1. No load testing before production deployment
2. Connection pool size not monitored
3. No alerts on connection pool saturation
## Action Items
- [ ] Add monitoring for connection pool utilization (Owner: DevOps, Due: 2026-01-15)
- [ ] Implement load testing in CI/CD (Owner: Platform, Due: 2026-01-22)
- [ ] Review connection pool configs across all services (Owner: Backend, Due: 2026-01-18)
- [ ] Add runbook for connection pool issues (Owner: SRE, Due: 2026-01-12)
## What Went Well
- Fast alert response time (3 minutes)
- Clear monitoring identified issue quickly
- Good team communication
## What Could Improve
- Earlier load testing would have caught this
- Connection pool metrics should have existed
2. Track incident trends:
# Incidents per week
count(incident_created_timestamp) by (severity)
# Mean time to resolution
avg(incident_resolved_timestamp - incident_created_timestamp) by (severity)
Conclusion
Kubernetes mistakes are common and often expensive. The good news: they’re avoidable with proper planning, automation, and operational discipline.
Key takeaways:
- Set resource requests and limits - Prevent resource exhaustion
- Never run as root - Follow least privilege principle
- Implement network policies - Default deny, explicit allow
- Use Pod Disruption Budgets - Maintain availability during disruptions
- Never commit secrets to Git - Use external secrets management
- Implement health checks - Enable self-healing
- Embrace immutability - Infrastructure as code, not manual changes
- Plan for disasters - Backup, test recovery procedures
- Start simple - Add complexity when needed
- Invest in observability - Monitor, alert, learn from incidents
Need help avoiding these mistakes in your Kubernetes journey? Tasrie IT Services provides Kubernetes consulting with a focus on production readiness, security, and operational excellence. Our team has helped dozens of organizations successfully adopt Kubernetes without the common pitfalls.
Schedule a consultation to discuss your Kubernetes implementation and get expert guidance.
Related Resources
- Kubernetes Consulting Services
- Kubernetes Security Best Practices
- Kubernetes Cost Optimization
- AWS EKS Consulting
- Azure AKS Consulting
- Google GKE Consulting
External resources: