Every consulting engagement we take starts with an infrastructure audit. Before writing a line of code or changing a single configuration, we need to understand what exists, what is broken, and what is wasting money.
This is the actual checklist we use. It covers security, cost, performance, reliability, and operational maturity across AWS, Azure, and GCP environments.
Why Audit Before Anything Else
Teams call us when something is wrong — costs are spiralling, deployments keep breaking, or they just had a security incident. The instinct is to jump straight to fixing the obvious problem. We resist that.
An audit usually reveals that the “obvious” problem is a symptom of deeper issues. High costs are caused by missing autoscaling and oversized instances. Deployment failures trace back to missing health checks and no rollback strategy. Security incidents stem from overly permissive IAM roles and unencrypted state files.
Fixing the symptom without fixing the root cause means we will be back in 6 months fixing the next symptom.
The Audit Framework: 6 Domains
We score each domain on a 1-5 scale:
| Score | Level | Description |
|---|---|---|
| 1 | Critical | Immediate action needed, significant risk |
| 2 | Below standard | Major gaps, needs attention within 30 days |
| 3 | Acceptable | Meets minimum requirements, room for improvement |
| 4 | Good | Well-implemented, minor optimisations available |
| 5 | Excellent | Best-in-class, optimised and automated |
Domain 1: Security (Check First)
Security issues can cause immediate business impact, so we always start here.
Identity and Access Management
- MFA enabled for all human users — especially root/admin accounts
- No root account access keys — root should never be used for daily operations
- Least-privilege IAM policies — no
*:*permissions, noAdministratorAccessattached to service roles - IAM roles for services — applications use IAM roles, not long-lived access keys
- Unused IAM users and roles removed — check for users who have not logged in for 90+ days
- Cross-account access reviewed — understand who has access from external accounts
# Find IAM users with console access but no MFA
aws iam generate-credential-report
aws iam get-credential-report --query 'Content' --output text | \
base64 -d | grep -E 'password_enabled.*true.*mfa_active.*false'
# Find users with access keys not used in 90 days
aws iam list-users --query 'Users[*].UserName' --output text | \
xargs -I {} aws iam list-access-keys --user-name {} \
--query 'AccessKeyMetadata[?Status==`Active`]'
Network Security
- No security groups with 0.0.0.0/0 on SSH (22) or RDP (3389) — restrict to specific IPs or VPN
- VPC flow logs enabled — for all VPCs, sent to CloudWatch or S3
- No public S3 buckets — unless intentionally hosting public content
- Private subnets for databases and internal services — only load balancers in public subnets
- WAF in front of public-facing applications — protects against OWASP top 10
# Find security groups allowing 0.0.0.0/0 on port 22
aws ec2 describe-security-groups \
--filters Name=ip-permission.from-port,Values=22 \
Name=ip-permission.cidr,Values=0.0.0.0/0 \
--query 'SecurityGroups[*].{ID:GroupId,Name:GroupName}'
Encryption
- S3 buckets encrypted — default encryption enabled on all buckets
- RDS encryption at rest — enabled on all database instances
- EBS volumes encrypted — default encryption enabled in account settings
- TLS everywhere — all internal and external communication over HTTPS/TLS
- KMS key rotation enabled — automatic rotation for all customer-managed keys
- Terraform state encrypted — state files containing secrets are encrypted
Logging and Monitoring
- CloudTrail enabled — in all regions, with log file validation
- CloudTrail logs sent to centralised S3 bucket — with lifecycle policies
- GuardDuty enabled — for threat detection across accounts
- Config enabled — for resource compliance tracking
- Alerts for critical events — root login, IAM policy changes, security group changes
Domain 2: Cost Optimisation
We typically find 30-50% waste. See our FinOps guide for detailed strategies.
Compute
- Right-sized instances — compare resource requests vs actual utilisation over 14 days
- Graviton instances where applicable — 20-40% cheaper for compatible workloads
- Spot instances for non-critical workloads — CI/CD, dev/staging, batch processing
- No idle instances — check for EC2 instances with < 5% CPU for 7+ days
- Non-production schedules — dev/staging shut down outside business hours
- Savings Plans or Reserved Instances — for stable, predictable workloads
# Find instances with <5% average CPU over 7 days
for id in $(aws ec2 describe-instances --query 'Reservations[*].Instances[*].InstanceId' --output text); do
avg=$(aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 --metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=$id \
--start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 604800 --statistics Average \
--query 'Datapoints[0].Average' --output text 2>/dev/null)
echo "$id: $avg%"
done
Storage
- S3 lifecycle policies — move infrequently accessed data to IA/Glacier
- Unattached EBS volumes — delete volumes not attached to any instance
- Old EBS snapshots — remove snapshots from terminated instances
- CloudWatch log retention — set retention periods (not indefinite)
Kubernetes (if applicable)
- Pod resource requests vs actual usage — most pods over-request by 50-80%
- VPA recommendations reviewed and applied
- Karpenter consolidation enabled
- Namespace-level cost allocation — teams see their costs via Kubecost
Quick Wins
- Delete unused Elastic IPs — $3.60/month each
- Remove unused load balancers — $16-22/month each
- Clean up old AMIs and snapshots
- Review NAT Gateway usage — often the silent cost killer
Domain 3: Reliability
Backup and Recovery
- Automated database backups — RDS automated backups enabled with appropriate retention
- Backup testing — backups restored and verified at least quarterly
- Disaster recovery plan documented — RPO and RTO defined for each service
- Multi-AZ for production databases — single-AZ is a single point of failure
- Cross-region backups — for critical data, replicate to a second region
High Availability
- Load balancers in front of all web services — no single-instance production services
- Auto-scaling configured — for all stateless services
- Health checks enabled — load balancer health checks on all targets
- Kubernetes readiness and liveness probes — on every container
- No single points of failure — every critical component has redundancy
Deployment
- Rolling deployments or blue-green — not big-bang deployments
- Rollback mechanism — can revert to previous version in < 5 minutes
- Infrastructure as Code — all infrastructure defined in Terraform/OpenTofu
- No manual console changes — all changes go through IaC pipeline
- Deployment frequency — can deploy to production at least weekly
Domain 4: Performance
Application
- Response time p95 < 500ms — for user-facing endpoints
- Database query performance — no queries taking > 1 second regularly
- Connection pooling — for database connections (avoid connection storms)
- CDN for static assets — CSS, JS, images served from edge locations
- Caching layer — Redis/ElastiCache for frequently accessed data
Infrastructure
- Instance types match workload — compute-optimised for CPU, memory-optimised for caches
- Storage IOPS adequate — gp3 with provisioned IOPS for database volumes
- Network throughput — instance type supports required bandwidth
- DNS TTL appropriate — not too high (delays failover) or too low (excess queries)
Domain 5: Operational Maturity
CI/CD
- Automated testing in pipeline — unit, integration, and/or end-to-end tests
- Pipeline runs on every PR — not just on merge
- Security scanning in pipeline — SAST, dependency audit, container scan
- Deployment requires approval — for production environments
- Pipeline execution time < 15 minutes — slow pipelines reduce deployment frequency
Monitoring and Alerting
- Metrics collection — CPU, memory, disk, network, application metrics
- Dashboards — for each service showing key health indicators
- Alerting — critical alerts go to PagerDuty/OpsGenie, not just email
- On-call rotation — defined, documented, and followed
- Runbooks — for common incidents (database failover, scaling, rollback)
Documentation
- Architecture diagram — current, not aspirational
- Network diagram — VPCs, subnets, peering, VPNs
- Access documentation — who has access to what, and why
- Incident response plan — what to do when production goes down
Domain 6: Compliance and Governance
- Resource tagging policy — all resources tagged with environment, team, project
- AWS Organizations — separate accounts for production, staging, development
- Service Control Policies — prevent dangerous actions (deleting CloudTrail, etc.)
- Budget alerts — alerts at 50%, 80%, and 100% of expected spend
- Regular access reviews — quarterly review of who has access to what
How We Run the Audit
Day 1: Automated scanning
- Run AWS Trusted Advisor, Security Hub, and Config
- Collect CloudWatch metrics for 14 days (or use existing data)
- Run Prowler for security benchmarking
- Export Cost Explorer data for the last 3 months
Day 2-3: Manual review
- Walk through each checklist domain
- Interview the team on operational practices
- Review IaC codebases (Terraform, CloudFormation)
- Assess CI/CD pipelines and deployment processes
Day 4: Report
- Score each domain (1-5)
- Prioritise findings: critical → high → medium → low
- Estimate cost savings and risk reduction
- Deliver actionable recommendations with timelines
Sample Audit Report Summary
| Domain | Score | Key Findings |
|---|---|---|
| Security | 2/5 | 3 security groups open to 0.0.0.0/0, root access keys active, no GuardDuty |
| Cost | 2/5 | 45% of instances under 10% CPU, no spot usage, no lifecycle policies |
| Reliability | 3/5 | Single-AZ database, no tested backup restoration, basic health checks |
| Performance | 3/5 | CDN not configured, no caching layer, adequate response times |
| Operations | 2/5 | No IaC, manual deployments, minimal monitoring |
| Compliance | 2/5 | No tagging policy, single AWS account, no budget alerts |
| Overall | 2.3/5 | 12 critical, 8 high, 15 medium findings |
The average first-time audit scores 2-3 out of 5. That is normal. The goal is to identify the highest-impact improvements and prioritise them.
Want a Professional Infrastructure Audit?
We audit cloud infrastructure across AWS, Azure, and GCP — identifying security risks, cost savings, and reliability gaps in one comprehensive review.
Our DevOps consulting services include:
- 4-day infrastructure audit — complete review across all 6 domains with actionable report
- Cost savings identification — we typically find 30-50% waste
- Security hardening — fix critical findings and implement ongoing monitoring
- Remediation support — implement the recommendations, not just report them
Every audit starts with a no-obligation discovery call to understand your environment.