~/blog/cloud-infrastructure-audit-checklist-2026
zsh
CLOUD

Cloud Infrastructure Audit Checklist: What We Check First (2026)

Engineering Team 2026-03-19

Every consulting engagement we take starts with an infrastructure audit. Before writing a line of code or changing a single configuration, we need to understand what exists, what is broken, and what is wasting money.

This is the actual checklist we use. It covers security, cost, performance, reliability, and operational maturity across AWS, Azure, and GCP environments.

Why Audit Before Anything Else

Teams call us when something is wrong — costs are spiralling, deployments keep breaking, or they just had a security incident. The instinct is to jump straight to fixing the obvious problem. We resist that.

An audit usually reveals that the “obvious” problem is a symptom of deeper issues. High costs are caused by missing autoscaling and oversized instances. Deployment failures trace back to missing health checks and no rollback strategy. Security incidents stem from overly permissive IAM roles and unencrypted state files.

Fixing the symptom without fixing the root cause means we will be back in 6 months fixing the next symptom.

The Audit Framework: 6 Domains

We score each domain on a 1-5 scale:

ScoreLevelDescription
1CriticalImmediate action needed, significant risk
2Below standardMajor gaps, needs attention within 30 days
3AcceptableMeets minimum requirements, room for improvement
4GoodWell-implemented, minor optimisations available
5ExcellentBest-in-class, optimised and automated

Domain 1: Security (Check First)

Security issues can cause immediate business impact, so we always start here.

Identity and Access Management

  • MFA enabled for all human users — especially root/admin accounts
  • No root account access keys — root should never be used for daily operations
  • Least-privilege IAM policies — no *:* permissions, no AdministratorAccess attached to service roles
  • IAM roles for services — applications use IAM roles, not long-lived access keys
  • Unused IAM users and roles removed — check for users who have not logged in for 90+ days
  • Cross-account access reviewed — understand who has access from external accounts
# Find IAM users with console access but no MFA
aws iam generate-credential-report
aws iam get-credential-report --query 'Content' --output text | \
  base64 -d | grep -E 'password_enabled.*true.*mfa_active.*false'

# Find users with access keys not used in 90 days
aws iam list-users --query 'Users[*].UserName' --output text | \
  xargs -I {} aws iam list-access-keys --user-name {} \
  --query 'AccessKeyMetadata[?Status==`Active`]'

Network Security

  • No security groups with 0.0.0.0/0 on SSH (22) or RDP (3389) — restrict to specific IPs or VPN
  • VPC flow logs enabled — for all VPCs, sent to CloudWatch or S3
  • No public S3 buckets — unless intentionally hosting public content
  • Private subnets for databases and internal services — only load balancers in public subnets
  • WAF in front of public-facing applications — protects against OWASP top 10
# Find security groups allowing 0.0.0.0/0 on port 22
aws ec2 describe-security-groups \
  --filters Name=ip-permission.from-port,Values=22 \
           Name=ip-permission.cidr,Values=0.0.0.0/0 \
  --query 'SecurityGroups[*].{ID:GroupId,Name:GroupName}'

Encryption

  • S3 buckets encrypted — default encryption enabled on all buckets
  • RDS encryption at rest — enabled on all database instances
  • EBS volumes encrypted — default encryption enabled in account settings
  • TLS everywhere — all internal and external communication over HTTPS/TLS
  • KMS key rotation enabled — automatic rotation for all customer-managed keys
  • Terraform state encrypted — state files containing secrets are encrypted

Logging and Monitoring

  • CloudTrail enabled — in all regions, with log file validation
  • CloudTrail logs sent to centralised S3 bucket — with lifecycle policies
  • GuardDuty enabled — for threat detection across accounts
  • Config enabled — for resource compliance tracking
  • Alerts for critical events — root login, IAM policy changes, security group changes

Domain 2: Cost Optimisation

We typically find 30-50% waste. See our FinOps guide for detailed strategies.

Compute

  • Right-sized instances — compare resource requests vs actual utilisation over 14 days
  • Graviton instances where applicable — 20-40% cheaper for compatible workloads
  • Spot instances for non-critical workloads — CI/CD, dev/staging, batch processing
  • No idle instances — check for EC2 instances with < 5% CPU for 7+ days
  • Non-production schedules — dev/staging shut down outside business hours
  • Savings Plans or Reserved Instances — for stable, predictable workloads
# Find instances with <5% average CPU over 7 days
for id in $(aws ec2 describe-instances --query 'Reservations[*].Instances[*].InstanceId' --output text); do
  avg=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=$id \
    --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 604800 --statistics Average \
    --query 'Datapoints[0].Average' --output text 2>/dev/null)
  echo "$id: $avg%"
done

Storage

  • S3 lifecycle policies — move infrequently accessed data to IA/Glacier
  • Unattached EBS volumes — delete volumes not attached to any instance
  • Old EBS snapshots — remove snapshots from terminated instances
  • CloudWatch log retention — set retention periods (not indefinite)

Kubernetes (if applicable)

  • Pod resource requests vs actual usage — most pods over-request by 50-80%
  • VPA recommendations reviewed and applied
  • Karpenter consolidation enabled
  • Namespace-level cost allocation — teams see their costs via Kubecost

Quick Wins

  • Delete unused Elastic IPs — $3.60/month each
  • Remove unused load balancers — $16-22/month each
  • Clean up old AMIs and snapshots
  • Review NAT Gateway usage — often the silent cost killer

Domain 3: Reliability

Backup and Recovery

  • Automated database backups — RDS automated backups enabled with appropriate retention
  • Backup testing — backups restored and verified at least quarterly
  • Disaster recovery plan documented — RPO and RTO defined for each service
  • Multi-AZ for production databases — single-AZ is a single point of failure
  • Cross-region backups — for critical data, replicate to a second region

High Availability

  • Load balancers in front of all web services — no single-instance production services
  • Auto-scaling configured — for all stateless services
  • Health checks enabled — load balancer health checks on all targets
  • Kubernetes readiness and liveness probes — on every container
  • No single points of failure — every critical component has redundancy

Deployment

  • Rolling deployments or blue-green — not big-bang deployments
  • Rollback mechanism — can revert to previous version in < 5 minutes
  • Infrastructure as Code — all infrastructure defined in Terraform/OpenTofu
  • No manual console changes — all changes go through IaC pipeline
  • Deployment frequency — can deploy to production at least weekly

Domain 4: Performance

Application

  • Response time p95 < 500ms — for user-facing endpoints
  • Database query performance — no queries taking > 1 second regularly
  • Connection pooling — for database connections (avoid connection storms)
  • CDN for static assets — CSS, JS, images served from edge locations
  • Caching layer — Redis/ElastiCache for frequently accessed data

Infrastructure

  • Instance types match workload — compute-optimised for CPU, memory-optimised for caches
  • Storage IOPS adequate — gp3 with provisioned IOPS for database volumes
  • Network throughput — instance type supports required bandwidth
  • DNS TTL appropriate — not too high (delays failover) or too low (excess queries)

Domain 5: Operational Maturity

CI/CD

  • Automated testing in pipeline — unit, integration, and/or end-to-end tests
  • Pipeline runs on every PR — not just on merge
  • Security scanning in pipeline — SAST, dependency audit, container scan
  • Deployment requires approval — for production environments
  • Pipeline execution time < 15 minutes — slow pipelines reduce deployment frequency

Monitoring and Alerting

  • Metrics collection — CPU, memory, disk, network, application metrics
  • Dashboards — for each service showing key health indicators
  • Alerting — critical alerts go to PagerDuty/OpsGenie, not just email
  • On-call rotation — defined, documented, and followed
  • Runbooks — for common incidents (database failover, scaling, rollback)

Documentation

  • Architecture diagram — current, not aspirational
  • Network diagram — VPCs, subnets, peering, VPNs
  • Access documentation — who has access to what, and why
  • Incident response plan — what to do when production goes down

Domain 6: Compliance and Governance

  • Resource tagging policy — all resources tagged with environment, team, project
  • AWS Organizations — separate accounts for production, staging, development
  • Service Control Policies — prevent dangerous actions (deleting CloudTrail, etc.)
  • Budget alerts — alerts at 50%, 80%, and 100% of expected spend
  • Regular access reviews — quarterly review of who has access to what

How We Run the Audit

Day 1: Automated scanning

  • Run AWS Trusted Advisor, Security Hub, and Config
  • Collect CloudWatch metrics for 14 days (or use existing data)
  • Run Prowler for security benchmarking
  • Export Cost Explorer data for the last 3 months

Day 2-3: Manual review

  • Walk through each checklist domain
  • Interview the team on operational practices
  • Review IaC codebases (Terraform, CloudFormation)
  • Assess CI/CD pipelines and deployment processes

Day 4: Report

  • Score each domain (1-5)
  • Prioritise findings: critical → high → medium → low
  • Estimate cost savings and risk reduction
  • Deliver actionable recommendations with timelines

Sample Audit Report Summary

DomainScoreKey Findings
Security2/53 security groups open to 0.0.0.0/0, root access keys active, no GuardDuty
Cost2/545% of instances under 10% CPU, no spot usage, no lifecycle policies
Reliability3/5Single-AZ database, no tested backup restoration, basic health checks
Performance3/5CDN not configured, no caching layer, adequate response times
Operations2/5No IaC, manual deployments, minimal monitoring
Compliance2/5No tagging policy, single AWS account, no budget alerts
Overall2.3/512 critical, 8 high, 15 medium findings

The average first-time audit scores 2-3 out of 5. That is normal. The goal is to identify the highest-impact improvements and prioritise them.


Want a Professional Infrastructure Audit?

We audit cloud infrastructure across AWS, Azure, and GCP — identifying security risks, cost savings, and reliability gaps in one comprehensive review.

Our DevOps consulting services include:

  • 4-day infrastructure audit — complete review across all 6 domains with actionable report
  • Cost savings identification — we typically find 30-50% waste
  • Security hardening — fix critical findings and implement ongoing monitoring
  • Remediation support — implement the recommendations, not just report them

Every audit starts with a no-obligation discovery call to understand your environment.

Request a free audit consultation →

Continue exploring these related topics

$ suggest --service

Concerned about security?

We help teams implement security best practices across their infrastructure and applications.

Get started
Chat with real humans
Chat on WhatsApp