Engineering

10 Critical Kubernetes Mistakes to Avoid in 2026 (And How to Fix Them)

Tasrie IT Services

Kubernetes is powerful but complex. After implementing Kubernetes clusters for dozens of organizations, we’ve seen the same mistakes repeated across companies of all sizes. These mistakes lead to outages, security breaches, cost overruns, and frustrated engineering teams.

This guide shares the 10 most critical Kubernetes mistakes we’ve encountered, explains why they happen, and provides actionable solutions based on real-world experience. Learn from others’ mistakes to avoid repeating them.

Table of Contents

  1. Not Setting Resource Requests and Limits
  2. Running Everything as Root
  3. No Network Policies (Default Allow-All)
  4. Ignoring Pod Disruption Budgets
  5. Storing Secrets in Git
  6. Not Implementing Health Checks
  7. Treating Kubernetes Like VMs
  8. No Disaster Recovery or Backup Strategy
  9. Over-Engineering from Day One
  10. Neglecting Observability and Monitoring

Mistake #1: Not Setting Resource Requests and Limits

The Problem

What happens:

  • Pods scheduled on nodes without considering actual resource needs
  • Single pod consuming entire node’s resources
  • Other pods starved of CPU/memory, causing cascading failures
  • Unpredictable costs from over-provisioning
  • Cluster autoscaler making poor decisions

Real-world example: A SaaS company experienced a production outage when a memory leak in one application consumed all available memory on a node, causing the kubelet to crash and all pods on that node to fail. The issue cascaded as traffic shifted to remaining nodes, overwhelming them as well.

Cost: 4.5 hours downtime, $180K in lost revenue

Why It Happens

  • Developers unfamiliar with resource management
  • “It works in development” mentality
  • Avoiding the effort of profiling actual usage
  • Fear of setting limits that might restrict performance

The Solution

Set appropriate resource requests and limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
      - name: api
        image: myapi:1.0
        resources:
          requests:
            cpu: "200m"      # Minimum guaranteed CPU
            memory: "256Mi"  # Minimum guaranteed memory
          limits:
            cpu: "500m"      # Maximum CPU (throttled beyond this)
            memory: "512Mi"  # Maximum memory (OOMKilled beyond this)

How to determine appropriate values:

  1. Start with VPA recommendations:
kubectl apply -f - <<EOF
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updateMode: "Off"  # Recommendation only
EOF

# Get recommendations after a few days
kubectl describe vpa api-vpa
  1. Monitor actual usage with Prometheus:
# 95th percentile memory usage over 7 days
quantile_over_time(0.95, container_memory_working_set_bytes{container="api"}[7d])

# 95th percentile CPU usage over 7 days
quantile_over_time(0.95, rate(container_cpu_usage_seconds_total{container="api"}[5m])[7d:5m])
  1. Set requests to p95, limits to 2x requests
  2. Use LimitRanges to enforce defaults:
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:  # Default limits if not specified
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:  # Default requests if not specified
      cpu: "100m"
      memory: "128Mi"
    max:  # Maximum allowed
      cpu: "4000m"
      memory: "8Gi"
    min:  # Minimum required
      cpu: "50m"
      memory: "64Mi"
    type: Container

Related: Our Kubernetes cost optimization guide covers right-sizing in detail.

Mistake #2: Running Everything as Root

The Problem

What happens:

  • Containers run with UID 0 (root)
  • Full access to node filesystem if container escapes
  • Privilege escalation attacks possible
  • Compliance failures (PCI DSS, HIPAA)
  • Audit findings and security violations

Real-world example: A financial services company failed a PCI DSS audit because 80% of their containers ran as root. The auditor identified this as a critical finding, blocking their ability to process credit card payments until remediated. Fixing required 6 weeks of work across 100+ microservices.

Cost: $450K in remediation effort, delayed product launch

Why It Happens

  • Default Dockerfile doesn’t create non-root user
  • “It just works” with root (file permissions, port binding)
  • Developers unaware of security implications
  • Legacy applications expecting root access

The Solution

1. Update Dockerfiles to create and use non-root user:

FROM node:18-alpine

# Create app user
RUN addgroup -g 1001 app && \
    adduser -u 1001 -G app -s /bin/sh -D app

# Set working directory and ownership
WORKDIR /app
COPY --chown=app:app . .

# Install dependencies
RUN npm ci --only=production

# Switch to non-root user
USER app

EXPOSE 3000
CMD ["node", "server.js"]

2. Enforce non-root in Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: secure-app
spec:
  securityContext:
    runAsNonRoot: true  # Reject if user is root
    runAsUser: 1001     # Explicit UID
    fsGroup: 2000       # Volume ownership
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:1.0
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL
        add:
        - NET_BIND_SERVICE  # Only if binding to ports < 1024

3. Use Pod Security Standards to enforce:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

4. Audit existing workloads:

# Find pods running as root
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] |
  select(.spec.securityContext.runAsUser == 0 or
         (.spec.securityContext.runAsUser == null and
          .spec.containers[].securityContext.runAsUser == null)) |
  "\(.metadata.namespace)/\(.metadata.name)"'

Related: Our Kubernetes security best practices guide covers this in depth.

Mistake #3: No Network Policies (Default Allow-All)

The Problem

What happens:

  • All pods can communicate with all other pods
  • Compromised pod can access entire cluster
  • Lateral movement for attackers
  • No segmentation or isolation
  • Compliance failures

Real-world example: An e-commerce platform had a vulnerability in their public-facing web application. Attackers compromised a web pod and used it to directly access the internal database pod, exfiltrating customer payment information. The breach went undetected for 3 weeks.

Cost: $2.3M in fines and legal settlements, severe reputation damage

Why It Happens

  • Default Kubernetes networking is allow-all
  • “We’ll add network policies later” (never happens)
  • Complexity of defining policies
  • Fear of breaking applications

The Solution

1. Start with default deny-all policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}  # Applies to all pods
  policyTypes:
  - Ingress
  - Egress

2. Allow specific traffic patterns:

---
# Allow web tier to access API tier
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: web-to-api
  namespace: production
spec:
  podSelector:
    matchLabels:
      tier: api
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          tier: web
    ports:
    - protocol: TCP
      port: 8080

---
# Allow API tier to access database
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-to-database
  namespace: production
spec:
  podSelector:
    matchLabels:
      tier: database
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          tier: api
    ports:
    - protocol: TCP
      port: 5432

---
# Allow egress for DNS and external APIs
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-egress
  namespace: production
spec:
  podSelector:
    matchLabels:
      tier: api
  policyTypes:
  - Egress
  egress:
  - to:  # Database
    - podSelector:
        matchLabels:
          tier: database
    ports:
    - protocol: TCP
      port: 5432
  - to:  # DNS
    - namespaceSelector:
        matchLabels:
          name: kube-system
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
  - to:  # External HTTPS
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443

3. Test network policies before enforcement:

# Test connectivity from web pod to api pod
kubectl exec -n production web-pod -- curl -v http://api-service:8080/health

# Test connectivity from web pod to database (should fail)
kubectl exec -n production web-pod -- nc -zv database-service 5432

4. Audit network policies:

# Check which namespaces have network policies
kubectl get networkpolicies --all-namespaces

# Visualize policies (using kubectl-netpol plugin)
kubectl netpol viz -n production

5. Use Cilium for advanced L7 policies:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: api-l7-policy
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      tier: api
  ingress:
  - fromEndpoints:
    - matchLabels:
        tier: web
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      rules:
        http:
        - method: "GET"
          path: "/api/v1/.*"
        - method: "POST"
          path: "/api/v1/orders"

Mistake #4: Ignoring Pod Disruption Budgets

The Problem

What happens:

  • Node drains during maintenance kill all replicas simultaneously
  • Cluster autoscaler scales down nodes with critical pods
  • Upgrades cause application downtime
  • Zero availability during voluntary disruptions

Real-world example: A SaaS company performed routine node upgrades during business hours. The upgrade drained nodes without respecting application availability, taking down all replicas of their authentication service simultaneously. Users couldn’t log in for 15 minutes during peak usage.

Cost: 15 minutes downtime, 12,000 affected users, reputation damage

Why It Happens

  • Developers unaware Pod Disruption Budgets (PDB) exist
  • Assuming Kubernetes “just handles” high availability
  • No testing of maintenance procedures
  • Lack of operational documentation

The Solution

1. Create Pod Disruption Budgets for critical services:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2  # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: api
      tier: api

Or use percentage:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  maxUnavailable: "30%"  # Allow maximum 30% unavailable
  selector:
    matchLabels:
      app: api

2. Ensure sufficient replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 5  # With minAvailable: 2, allows draining 3 at once
  template:
    metadata:
      labels:
        app: api
        tier: api

3. Test node drains:

# Drain node safely (respects PDBs)
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

# Check PDB status
kubectl get pdb -n production
kubectl describe pdb api-pdb -n production

# If PDB blocks drain, check why
kubectl get pods -n production -l app=api -o wide

4. Set PDBs for all stateful workloads:

---
# Database PDB
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: database-pdb
  namespace: production
spec:
  minAvailable: 2  # Always keep 2 replicas (master + 1 standby)
  selector:
    matchLabels:
      app: postgresql

---
# Cache PDB
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: redis-pdb
  namespace: production
spec:
  maxUnavailable: 1  # Allow only 1 Redis pod down at a time
  selector:
    matchLabels:
      app: redis

5. Anti-affinity for replica distribution:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 5
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: api
              topologyKey: kubernetes.io/hostname  # Spread across nodes

Mistake #5: Storing Secrets in Git

The Problem

What happens:

  • Database passwords, API keys in Git repository
  • Secrets exposed in commit history forever
  • Public repository leak = immediate breach
  • Compliance violations (SOC 2, ISO 27001)

Real-world example: A startup accidentally pushed AWS credentials to a public GitHub repository. Within 12 minutes, attackers discovered the credentials and spun up $50,000 worth of EC2 instances for cryptocurrency mining. GitHub’s automated scanning alerted them, but damage was done.

Cost: $50,000 in fraudulent charges, security incident response, credential rotation

Why It Happens

  • Convenience (“it’s just a private repo”)
  • Lack of secrets management solution
  • Unclear boundaries between config and secrets
  • Accidental commits (.env files)

The Solution

1. Use external secrets management:

Option A: HashiCorp Vault

apiVersion: v1
kind: Pod
metadata:
  name: app
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "app-role"
    vault.hashicorp.com/agent-inject-secret-database: "database/creds/app"
spec:
  serviceAccountName: app
  containers:
  - name: app
    image: myapp:1.0
    env:
    - name: DB_PASSWORD
      value: "file:///vault/secrets/database"

Option B: External Secrets Operator

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-manager
  namespace: production
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: app

---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: SecretStore
  target:
    name: database-secret
    creationPolicy: Owner
  data:
  - secretKey: password
    remoteRef:
      key: production/database
      property: password

2. Use Sealed Secrets for GitOps:

# Install Sealed Secrets controller
kubectl apply -f https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.24.0/controller.yaml

# Create sealed secret (safe to commit)
kubectl create secret generic mysecret \
  --from-literal=password=supersecret \
  --dry-run=client -o yaml | \
  kubeseal -o yaml > sealed-secret.yaml

# Commit sealed-secret.yaml to Git safely
git add sealed-secret.yaml
git commit -m "Add database credentials"

3. Scan repositories for secrets:

# Install Gitleaks
brew install gitleaks

# Scan repository
gitleaks detect --source . --verbose

# Pre-commit hook
cat > .git/hooks/pre-commit << 'EOF'
#!/bin/bash
gitleaks protect --staged --verbose
EOF
chmod +x .git/hooks/pre-commit

4. Rotate compromised secrets immediately:

# If secrets leaked:
# 1. Rotate credentials immediately
# 2. Update secret in secrets manager
# 3. Restart pods to pick up new secret
kubectl rollout restart deployment/app -n production

# 4. Rewrite Git history (if exposed in commits)
git filter-branch --force --index-filter \
  "git rm --cached --ignore-unmatch .env" \
  --prune-empty --tag-name-filter cat -- --all

Mistake #6: Not Implementing Health Checks

The Problem

What happens:

  • Kubernetes sends traffic to pods that aren’t ready
  • Pods in crash loop restart appear “healthy” briefly
  • Application deadlocks undetected
  • Users experience errors despite “healthy” cluster

Real-world example: An API service had a subtle database connection pool exhaustion issue. The pod remained running, but all API requests timed out. Without proper health checks, Kubernetes continued routing traffic to the broken pod for 8 minutes until manually restarted.

Cost: 8 minutes of 50% error rate, customer complaints

Why It Happens

  • “The pod is running, so it must be working”
  • Lack of understanding liveness vs readiness probes
  • Difficulty implementing health check logic
  • Fear health checks will cause false positives

The Solution

1. Implement both liveness and readiness probes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
      - name: api
        image: myapi:1.0
        ports:
        - containerPort: 8080

        # Liveness: Is the application alive?
        # Failure = restart the container
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30  # Wait for app to start
          periodSeconds: 10         # Check every 10 seconds
          timeoutSeconds: 5
          failureThreshold: 3       # Restart after 3 failures

        # Readiness: Is the application ready for traffic?
        # Failure = remove from service endpoints
        readinessProbe:
          httpGet:
            path: /readyz
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2       # Remove after 2 failures

        # Startup: Special probe for slow-starting apps
        startupProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 10
          failureThreshold: 30      # Allow 300 seconds to start

2. Implement comprehensive health check endpoints:

// /healthz endpoint (liveness)
func healthzHandler(w http.ResponseWriter, r *http.Request) {
    // Basic health check - is the app alive?
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("OK"))
}

// /readyz endpoint (readiness)
func readyzHandler(w http.ResponseWriter, r *http.Request) {
    // Check all dependencies
    checks := []struct {
        name string
        fn   func() error
    }{
        {"database", checkDatabase},
        {"cache", checkCache},
        {"message_queue", checkMessageQueue},
    }

    for _, check := range checks {
        if err := check.fn(); err != nil {
            w.WriteHeader(http.StatusServiceUnavailable)
            fmt.Fprintf(w, "NOT READY: %s failed: %v\n", check.name, err)
            return
        }
    }

    w.WriteHeader(http.StatusOK)
    w.Write([]byte("READY"))
}

func checkDatabase() error {
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
    defer cancel()
    return db.PingContext(ctx)
}

3. Use exec probes for non-HTTP applications:

livenessProbe:
  exec:
    command:
    - /bin/grpc_health_probe
    - -addr=:9090
  initialDelaySeconds: 30
  periodSeconds: 10

4. Monitor probe failures:

# Alert on high probe failure rate
rate(kube_pod_container_status_restarts_total[5m]) > 0.1

Mistake #7: Treating Kubernetes Like VMs

The Problem

What happens:

  • SSH into containers to debug
  • Manual configuration changes inside running containers
  • Stateful data stored in container filesystem
  • Long-lived pods treated like long-lived VMs
  • Expecting containers to survive restarts

Real-world example: An operations team was manually SSH-ing into containers to apply configuration changes and hotfixes. When a node failed, all their manual changes were lost, and they couldn’t reproduce the production configuration. It took 6 hours to rebuild the correct state.

Cost: 6 hours outage, loss of configuration state

Why It Happens

  • VM mindset from previous infrastructure
  • “It’s faster to just SSH in and fix it”
  • Lack of understanding Kubernetes declarative model
  • Insufficient automation and GitOps

The Solution

1. Embrace immutable infrastructure:

# ❌ Bad: Logging into container and making changes
kubectl exec -it api-pod -- /bin/bash
# (makes manual changes inside container)

# ✅ Good: Update deployment manifest, redeploy
# Edit deployment.yaml
kubectl apply -f deployment.yaml
# Or use GitOps (ArgoCD, Flux)

2. Use ConfigMaps and Secrets for configuration:

# Configuration as code
apiVersion: v1
kind: ConfigMap
metadata:
  name: api-config
data:
  app.conf: |
    log_level: info
    max_connections: 100
    cache_ttl: 300

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
      - name: api
        image: myapi:1.0
        volumeMounts:
        - name: config
          mountPath: /etc/app/
      volumes:
      - name: config
        configMap:
          name: api-config

3. Debug without exec:

# View logs instead of SSH
kubectl logs api-pod -n production --tail=100 -f

# Port-forward for debugging
kubectl port-forward api-pod 8080:8080

# Copy files out for analysis
kubectl cp production/api-pod:/tmp/debug.log ./debug.log

# Use ephemeral debug containers (Kubernetes 1.23+)
kubectl debug api-pod -it --image=busybox --target=api

4. Use GitOps for declarative management:

# ArgoCD Application
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api
spec:
  source:
    repoURL: https://github.com/myorg/k8s-manifests
    path: production/api
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

5. Use ReadOnlyRootFilesystem:

apiVersion: v1
kind: Pod
metadata:
  name: api
spec:
  containers:
  - name: api
    image: myapi:1.0
    securityContext:
      readOnlyRootFilesystem: true  # Prevent writes to container FS
    volumeMounts:
    - name: tmp
      mountPath: /tmp  # Writable temp directory
    - name: cache
      mountPath: /var/cache
  volumes:
  - name: tmp
    emptyDir: {}
  - name: cache
    emptyDir: {}

Mistake #8: No Disaster Recovery or Backup Strategy

The Problem

What happens:

  • Accidental kubectl delete deletes production resources
  • Cluster failure with no recovery plan
  • etcd corruption with no backup
  • Persistent volumes lost with no snapshots

Real-world example: A developer accidentally ran kubectl delete namespace production instead of kubectl delete namespace production-test. The entire production environment was deleted. Without backups, they had to recreate everything from scratch, taking 14 hours and losing some stateful data permanently.

Cost: 14 hours downtime, permanent data loss, $420K in lost revenue

Why It Happens

  • “It won’t happen to us” mentality
  • Backup perceived as low priority
  • Complexity of distributed system backups
  • Lack of DR planning and testing

The Solution

1. Use Velero for cluster-level backups:

# Install Velero
velero install \
  --provider aws \
  --bucket velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --use-node-agent

# Create backup schedule
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --ttl 168h \
  --include-namespaces production,staging

# Backup specific application
velero backup create api-backup \
  --selector app=api \
  --include-namespaces production

# Restore from backup
velero restore create --from-backup daily-backup-20260109

2. Enable persistent volume snapshots:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: daily-snapshots
driver: ebs.csi.aws.com
deletionPolicy: Delete
parameters:
  tags: "environment=production"

---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: database-snapshot
  namespace: production
spec:
  volumeSnapshotClassName: daily-snapshots
  source:
    persistentVolumeClaimName: database-pvc

3. Implement GitOps (infrastructure as code):

# All Kubernetes manifests in Git
myrepo/
├── base/
   ├── deployments/
   ├── services/
   └── kustomization.yaml
├── overlays/
   ├── production/
   └── staging/
└── README.md

# Disaster recovery = recreate from Git
kubectl apply -k overlays/production/

4. Test disaster recovery procedures:

# Quarterly DR drill
# 1. Backup production
velero backup create dr-test

# 2. Create test cluster
eksctl create cluster --name dr-test

# 3. Restore backup
velero restore create --from-backup dr-test

# 4. Verify applications
kubectl get pods --all-namespaces
curl https://test-api.example.com/health

# 5. Document findings and update DR runbook

5. Protect against accidental deletion:

# ResourceQuota with terminationGracePeriodSeconds
apiVersion: v1
kind: ResourceQuota
metadata:
  name: namespace-protection
  namespace: production
spec:
  hard:
    persistentvolumeclaims: "100"
  scopeSelector:
    matchExpressions:
    - operator: Exists
      scopeName: Terminating
# RBAC to limit delete permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: developer
  namespace: production
rules:
- apiGroups: ["", "apps"]
  resources: ["pods", "deployments", "services"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
  # Notice: "delete" is NOT included

Mistake #9: Over-Engineering from Day One

The Problem

What happens:

  • Complex service mesh before understanding requirements
  • Multi-cluster federation for single application
  • Operators and CRDs for simple workloads
  • Team overwhelmed by tooling complexity
  • Long time-to-production

Real-world example: A startup with 3 engineers and 5 microservices implemented Istio service mesh, ArgoCD, Crossplane, Cilium, OPA Gatekeeper, and Vault on day one. They spent 8 weeks on infrastructure with zero application progress. When an engineer left, no one understood the entire system.

Cost: 8 weeks of development time, technical debt, team frustration

Why It Happens

  • Resume-driven development
  • “Best practices” without context
  • Fear of future scaling problems
  • Cargo culting from large companies

The Solution

1. Start simple, add complexity when needed:

Week 1-2: Basic Kubernetes

# Just Deployments, Services, Ingress
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: myapi:1.0
---
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
  ports:
  - port: 80
    targetPort: 8080

Month 2: Add observability

  • Prometheus + Grafana
  • Basic alerts

Month 3: Add autoscaling

  • HPA for pods
  • Cluster Autoscaler

Month 4-6: Add based on actual needs

  • Service mesh (if you have 50+ microservices)
  • GitOps (when manual deploys become bottleneck)
  • Advanced security (when compliance requires it)

2. Use managed services:

# ❌ Self-hosting everything
- Self-managed Prometheus
- Self-managed Grafana
- Self-managed Vault
- Self-managed ArgoCD

# ✅ Use managed when available
- AWS CloudWatch (or AWS Managed Prometheus)
- AWS Managed Grafana
- AWS Secrets Manager
- Simple kubectl apply (initially)

3. Avoid premature optimization:

# ❌ Over-engineered for 3 services
- Service mesh with mTLS
- Multi-cluster federation
- Custom operators for everything
- Complex Crossplane compositions

# ✅ Simple and maintainable
- Plain Kubernetes networking
- Single cluster
- Standard Kubernetes resources
- Cloud provider integrations

4. Decision framework:

ToolUse When…Avoid If…
Service Mesh50+ microservices, complex traffic patterns< 20 services, simple architecture
GitOps5+ engineers, frequent deploys1-2 engineers, infrequent changes
CrossplaneMulti-cloud, complex infrastructureSingle cloud, simple needs
OperatorsManaging stateful systems at scaleStateless apps, simple deployments

Mistake #10: Neglecting Observability and Monitoring

The Problem

What happens:

  • Issues discovered by users, not monitoring
  • No visibility into application performance
  • Debugging requires kubectl exec and guessing
  • Mean time to resolution (MTTR) measured in hours
  • No capacity planning data

Real-world example: A platform experienced gradual performance degradation over 3 days as database connections slowly leaked. Without proper monitoring, they only discovered the issue when the system completely failed during peak traffic. Post-mortem showed clear metrics trending negatively for days—but no one was watching.

Cost: 6 hours complete outage, $280K lost revenue

Why It Happens

  • “We’ll add monitoring later” (never happens)
  • Perceived as operational overhead
  • Developers unfamiliar with observability tools
  • No ownership of production health

The Solution

1. Implement the three pillars of observability:

Metrics (Prometheus):

# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-metrics
  namespace: production
spec:
  selector:
    matchLabels:
      app: api
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

Logs (Loki/ELK):

# Fluent Bit config for log aggregation
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*

    [OUTPUT]
        Name   loki
        Match  kube.*
        Host   loki.logging.svc
        Port   3100

Traces (Jaeger/Tempo):

# OpenTelemetry Collector
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:
      batch:
    exporters:
      jaeger:
        endpoint: jaeger.tracing.svc:14250
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [jaeger]

2. Define and monitor SLOs:

# Example SLO: 99.9% of requests < 500ms latency
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-slos
spec:
  groups:
  - name: api-latency-slo
    interval: 30s
    rules:
    - record: api:latency:slo_ratio
      expr: |
        sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
        /
        sum(rate(http_request_duration_seconds_count[5m]))

    - alert: SLOBudgetBurn
      expr: api:latency:slo_ratio < 0.999
      for: 5m
      annotations:
        summary: "API latency SLO burning ({{ $value }})"

3. Create actionable alerts:

# Good alert (actionable)
- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate on {{ $labels.service }}"
    description: "{{ $labels.service }} error rate is {{ $value | humanizePercentage }}"
    runbook: "https://wiki.example.com/runbooks/high-error-rate"

# Bad alert (noisy, not actionable)
- alert: PodRestarted
  expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
  # Fires constantly, no context, unclear action

4. Build comprehensive dashboards:

{
  "dashboard": {
    "title": "API Service Health",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (status)"
          }
        ]
      },
      {
        "title": "Latency (p50, p95, p99)",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
          }
        ]
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
          }
        ]
      },
      {
        "title": "Resource Usage (CPU, Memory)",
        "targets": [
          {
            "expr": "sum(rate(container_cpu_usage_seconds_total{pod=~\"api-.*\"}[5m]))"
          },
          {
            "expr": "sum(container_memory_working_set_bytes{pod=~\"api-.*\"})"
          }
        ]
      }
    ]
  }
}

Related: Our Kubernetes consulting services include observability implementation.

Bonus Mistake: Not Learning from Incidents

The Problem

Issues repeat because no post-mortems are conducted or lessons aren’t applied.

The Solution

1. Conduct blameless post-mortems:

# Incident Post-Mortem Template

## Incident Summary
- Date: 2026-01-09
- Duration: 45 minutes
- Impact: 15% error rate on API
- Severity: P1

## Timeline
- 14:32: Alert fired for high error rate
- 14:35: Engineer acknowledged alert
- 14:40: Root cause identified (database connection pool exhausted)
- 15:05: Increased connection pool size
- 15:17: Service recovered

## Root Cause
Database connection pool configured for 20 max connections, insufficient for current traffic (150 req/s).

## Contributing Factors
1. No load testing before production deployment
2. Connection pool size not monitored
3. No alerts on connection pool saturation

## Action Items
- [ ] Add monitoring for connection pool utilization (Owner: DevOps, Due: 2026-01-15)
- [ ] Implement load testing in CI/CD (Owner: Platform, Due: 2026-01-22)
- [ ] Review connection pool configs across all services (Owner: Backend, Due: 2026-01-18)
- [ ] Add runbook for connection pool issues (Owner: SRE, Due: 2026-01-12)

## What Went Well
- Fast alert response time (3 minutes)
- Clear monitoring identified issue quickly
- Good team communication

## What Could Improve
- Earlier load testing would have caught this
- Connection pool metrics should have existed

2. Track incident trends:

# Incidents per week
count(incident_created_timestamp) by (severity)

# Mean time to resolution
avg(incident_resolved_timestamp - incident_created_timestamp) by (severity)

Conclusion

Kubernetes mistakes are common and often expensive. The good news: they’re avoidable with proper planning, automation, and operational discipline.

Key takeaways:

  1. Set resource requests and limits - Prevent resource exhaustion
  2. Never run as root - Follow least privilege principle
  3. Implement network policies - Default deny, explicit allow
  4. Use Pod Disruption Budgets - Maintain availability during disruptions
  5. Never commit secrets to Git - Use external secrets management
  6. Implement health checks - Enable self-healing
  7. Embrace immutability - Infrastructure as code, not manual changes
  8. Plan for disasters - Backup, test recovery procedures
  9. Start simple - Add complexity when needed
  10. Invest in observability - Monitor, alert, learn from incidents

Need help avoiding these mistakes in your Kubernetes journey? Tasrie IT Services provides Kubernetes consulting with a focus on production readiness, security, and operational excellence. Our team has helped dozens of organizations successfully adopt Kubernetes without the common pitfalls.

Schedule a consultation to discuss your Kubernetes implementation and get expert guidance.

External resources:

Related Articles

Continue exploring these related topics

Chat with real humans