CrashLoopBackOff on Azure Kubernetes Service (AKS) has all the same causes as any Kubernetes cluster, plus a set of Azure-specific issues that only appear when running on AKS. At Tasrie IT Services, we manage dozens of AKS clusters for clients across Europe and the Middle East. Many CrashLoopBackOff issues we see are caused by Azure-specific configurations that do not exist on EKS or GKE.
This guide covers both the general CrashLoopBackOff debugging process and the Azure-specific causes you need to check when running on AKS.
Quick Diagnosis on AKS
Start with the standard Kubernetes debugging commands:
# Check pod status
kubectl get pods -n <namespace>
# Get exit code and events
kubectl describe pod <pod-name> -n <namespace>
# Check previous container logs
kubectl logs <pod-name> -n <namespace> --previous
If the standard debugging does not reveal the cause, move to the Azure-specific checks below.
General CrashLoopBackOff Fixes (Apply to All Clusters)
Before checking Azure-specific issues, rule out the common causes:
| Exit Code | Cause | Fix |
|---|---|---|
| 1 | Application error | Check logs for error messages, missing env vars, connection failures |
| 127 | Command not found | Fix entrypoint command or image |
| 137 | OOMKilled | Increase memory limit or fix memory leak |
| 139 | Segfault | Update base image or check architecture mismatch |
For detailed exit code debugging, see our CrashLoopBackOff error guide.
Azure-Specific Cause 1: Azure Key Vault CSI Driver Failures
Many AKS applications use the Azure Key Vault Provider for Secrets Store CSI Driver to mount secrets from Azure Key Vault. When this fails, the pod enters CrashLoopBackOff because the application cannot read its secrets.
# Check if the SecretProviderClass exists
kubectl get secretproviderclass -n <namespace>
# Check CSI driver pods
kubectl get pods -n kube-system -l app=secrets-store-csi-driver
# Check the SecretProviderClass events
kubectl describe secretproviderclass <name> -n <namespace>
# Check pod events for mount errors
kubectl describe pod <pod-name> -n <namespace> | grep -i "key vault\|csi\|secret"
Common error messages:
| Error | Cause | Fix |
|---|---|---|
failed to get keyvault client | Identity not configured | Set up Workload Identity or managed identity |
keyvault forbidden | Identity lacks access policy | Add Get/List permissions in Key Vault access policies |
secret not found | Secret name wrong or does not exist | Verify secret name in Key Vault matches SecretProviderClass |
SecretProviderClass not found | Missing SecretProviderClass resource | Create the SecretProviderClass in the correct namespace |
Fix Key Vault access with Workload Identity:
# Check if the service account has the workload identity annotation
kubectl get sa <service-account> -n <namespace> -o yaml | grep azure.workload.identity
# Verify the federated credential exists
az identity federated-credential list --identity-name <identity-name> --resource-group <rg>
# Check Key Vault access policies
az keyvault show --name <vault-name> --query 'properties.accessPolicies'
Azure-Specific Cause 2: Azure CNI IP Exhaustion
AKS with Azure CNI assigns a real Azure VNET IP to each pod. When the subnet runs out of IPs, new pods cannot get an IP address and may crash during startup.
# Check subnet IP usage
az network vnet subnet show --resource-group <rg> --vnet-name <vnet> --name <subnet> --query 'ipConfigurations | length(@)'
# Check available IPs
az network vnet subnet show --resource-group <rg> --vnet-name <vnet> --name <subnet> --query addressPrefix
# Check the Azure CNI pods
kubectl get pods -n kube-system -l k8s-app=azure-cni
kubectl logs -n kube-system -l k8s-app=azure-cni --tail=20
Signs of IP exhaustion:
- Pods stuck in
ContainerCreatingbefore entering CrashLoopBackOff - Events showing
Failed to allocate address - New nodes failing to join the cluster
Fixes:
- Expand the subnet address range
- Switch to Azure CNI Overlay mode which uses a separate overlay network for pod IPs
- Switch to kubenet networking (simpler but with limitations)
- Reduce the number of pods per node with
--max-pods
Azure-Specific Cause 3: Managed Identity and AAD Issues
Applications using Azure Managed Identity (via Workload Identity or AAD Pod Identity) crash when they cannot authenticate to Azure services.
# Check Workload Identity webhook
kubectl get pods -n kube-system -l azure-workload-identity.io/system=true
# Check if the service account is annotated correctly
kubectl get sa <sa-name> -n <namespace> -o yaml
# Check pod for identity-related env vars
kubectl exec <pod-name> -n <namespace> -- env | grep AZURE
Common errors in application logs:
| Error | Cause | Fix |
|---|---|---|
ManagedIdentityCredential authentication failed | Workload Identity not configured | Configure federated credential for the service account |
DefaultAzureCredential failed | No Azure identity available | Ensure identity is assigned and RBAC is correct |
AADSTS700016: Application not found | Wrong client ID | Verify the client ID in the service account annotation |
Fix Workload Identity setup:
# Create user-assigned managed identity
az identity create --name <identity-name> --resource-group <rg>
# Create federated credential
az identity federated-credential create \
--name <fedcred-name> \
--identity-name <identity-name> \
--resource-group <rg> \
--issuer $(az aks show -n <cluster-name> -g <rg> --query oidcIssuerProfile.issuerUrl -o tsv) \
--subject system:serviceaccount:<namespace>:<sa-name>
# Annotate the service account
kubectl annotate sa <sa-name> -n <namespace> \
azure.workload.identity/client-id=<client-id>
# Label the service account
kubectl label sa <sa-name> -n <namespace> \
azure.workload.identity/use=true
Azure-Specific Cause 4: AKS Node Pool Issues
CrashLoopBackOff can be caused by AKS node pool problems that do not appear on other platforms.
VM Size Mismatch
If the node pool uses a VM size that does not have enough CPU or memory for the pod’s resource requests, the pod gets scheduled but crashes under load.
# Check node pool VM size
az aks nodepool list --resource-group <rg> --cluster-name <cluster-name> -o table
# Check node resources
kubectl describe node <node-name> | grep -A 5 "Allocatable"
Ephemeral OS Disk Full
AKS nodes using ephemeral OS disks have limited disk space tied to the VM cache size. Container images and logs can fill this quickly.
# Check node disk pressure
kubectl describe node <node-name> | grep DiskPressure
# SSH into node and check disk
kubectl debug node/<node-name> -it --image=mcr.microsoft.com/cbl-mariner/busybox:2.0
# Inside: chroot /host && df -h
Fix: Use a larger VM size or switch to managed OS disks.
Accelerated Networking Conflicts
Some container networking configurations conflict with Azure Accelerated Networking, causing intermittent pod crashes.
# Check if accelerated networking is enabled
az vmss show --resource-group <node-rg> --name <vmss-name> --query 'virtualMachineProfile.networkProfile.networkInterfaceConfigurations[0].enableAcceleratedNetworking'
Azure-Specific Cause 5: Azure Container Registry (ACR) Pull Failures
AKS pods can enter CrashLoopBackOff if the initial pull succeeds but subsequent pulls fail (for imagePullPolicy: Always).
# Check ACR integration
az aks check-acr --name <cluster-name> --resource-group <rg> --acr <acr-name>
# Check if AKS has ACR pull role
az role assignment list --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ContainerRegistry/registries/<acr-name> --query "[?principalName=='<kubelet-identity>']"
Fix ACR pull permissions:
# Attach ACR to AKS (adds AcrPull role)
az aks update --name <cluster-name> --resource-group <rg> --attach-acr <acr-name>
Azure-Specific Cause 6: Network Security Group (NSG) Blocking Traffic
NSG rules can block outbound traffic that the application needs, causing it to crash when it cannot reach Azure services or external APIs.
# Check effective NSG rules on the node subnet
az network nsg rule list --nsg-name <nsg-name> --resource-group <rg> -o table
# Test connectivity from inside a pod
kubectl run nettest --image=mcr.microsoft.com/cbl-mariner/busybox:2.0 --rm -it --restart=Never -- wget -qO- https://management.azure.com/health 2>&1
Ports that must be open for AKS:
- 443 — API server, Azure services
- 9000 — Tunnelfront/aks-link
- 1194 — UDP for tunnel communication
- 123 — NTP (UDP)
AKS Debugging Tools
Azure Monitor and Container Insights
# Enable Container Insights if not already enabled
az aks enable-addons --addon monitoring --name <cluster-name> --resource-group <rg>
# Query container logs in Azure Monitor
az monitor log-analytics query \
--workspace <workspace-id> \
--analytics-query "ContainerLogV2 | where PodName == '<pod-name>' | order by TimeGenerated desc | take 50"
AKS Diagnostics
# Run AKS diagnostics
az aks get-credentials --resource-group <rg> --name <cluster-name>
az aks show --resource-group <rg> --name <cluster-name> --query 'provisioningState'
# Check AKS cluster health
az aks show --resource-group <rg> --name <cluster-name> --query 'powerState'
kubectl debug on AKS Nodes
# AKS-specific node debugging (uses Mariner base image)
kubectl debug node/<node-name> -it --image=mcr.microsoft.com/cbl-mariner/busybox:2.0
AKS CrashLoopBackOff Checklist
[ ] Check exit code and logs (kubectl describe pod, kubectl logs --previous)
[ ] Check if Key Vault CSI driver is failing (SecretProviderClass, identity)
[ ] Check subnet IP availability (az network vnet subnet show)
[ ] Check Workload Identity configuration (service account annotations)
[ ] Check ACR pull permissions (az aks check-acr)
[ ] Check NSG rules for blocked outbound traffic
[ ] Check node pool VM size and disk pressure
[ ] Check Azure service health (status.azure.com)
For general CrashLoopBackOff debugging that applies to all Kubernetes platforms, see our CrashLoopBackOff pod fix guide. For a broader overview of Kubernetes troubleshooting, see our production troubleshooting guide.
Need Help With AKS CrashLoopBackOff Issues?
Azure AKS adds a layer of complexity that requires Azure-specific expertise. Our engineers at Tasrie IT Services manage dozens of AKS clusters and can help you resolve CrashLoopBackOff issues fast and build the infrastructure to prevent them.
Our AKS consulting services include:
- AKS incident response with engineers who understand Azure networking, identity, and storage
- Workload Identity and Key Vault integration setup and troubleshooting
- AKS cluster optimisation for reliability, security, and performance
We also provide consulting for EKS and GKE if you run multi-cloud Kubernetes.