Cloud Infrastructure Audit Checklist: What We Check First (2026)

Every consulting engagement we take starts with an infrastructure audit. Before writing a line of code or changing a single configuration, we need to understand what exists, what is broken, and what is wasting money.

This is the actual checklist we use. It covers security, cost, performance, reliability, and operational maturity across AWS, Azure, and GCP environments.

Why Audit Before Anything Else

Teams call us when something is wrong — costs are spiralling, deployments keep breaking, or they just had a security incident. The instinct is to jump straight to fixing the obvious problem. We resist that.

An audit usually reveals that the “obvious” problem is a symptom of deeper issues. High costs are caused by missing autoscaling and oversized instances. Deployment failures trace back to missing health checks and no rollback strategy. Security incidents stem from overly permissive IAM roles and unencrypted state files.

Fixing the symptom without fixing the root cause means we will be back in 6 months fixing the next symptom.

The Audit Framework: 6 Domains

We score each domain on a 1-5 scale:

Score	Level	Description
1	Critical	Immediate action needed, significant risk
2	Below standard	Major gaps, needs attention within 30 days
3	Acceptable	Meets minimum requirements, room for improvement
4	Good	Well-implemented, minor optimisations available
5	Excellent	Best-in-class, optimised and automated

Domain 1: Security (Check First)

Security issues can cause immediate business impact, so we always start here.

Identity and Access Management

MFA enabled for all human users — especially root/admin accounts
No root account access keys — root should never be used for daily operations
Least-privilege IAM policies — no *:* permissions, no AdministratorAccess attached to service roles
IAM roles for services — applications use IAM roles, not long-lived access keys
Unused IAM users and roles removed — check for users who have not logged in for 90+ days
Cross-account access reviewed — understand who has access from external accounts

# Find IAM users with console access but no MFA
aws iam generate-credential-report
aws iam get-credential-report --query 'Content' --output text | \
  base64 -d | grep -E 'password_enabled.*true.*mfa_active.*false'

# Find users with access keys not used in 90 days
aws iam list-users --query 'Users[*].UserName' --output text | \
  xargs -I {} aws iam list-access-keys --user-name {} \
  --query 'AccessKeyMetadata[?Status==`Active`]'

Network Security

No security groups with 0.0.0.0/0 on SSH (22) or RDP (3389) — restrict to specific IPs or VPN
VPC flow logs enabled — for all VPCs, sent to CloudWatch or S3
No public S3 buckets — unless intentionally hosting public content
Private subnets for databases and internal services — only load balancers in public subnets
WAF in front of public-facing applications — protects against OWASP top 10

# Find security groups allowing 0.0.0.0/0 on port 22
aws ec2 describe-security-groups \
  --filters Name=ip-permission.from-port,Values=22 \
           Name=ip-permission.cidr,Values=0.0.0.0/0 \
  --query 'SecurityGroups[*].{ID:GroupId,Name:GroupName}'

Encryption

S3 buckets encrypted — default encryption enabled on all buckets
RDS encryption at rest — enabled on all database instances
EBS volumes encrypted — default encryption enabled in account settings
TLS everywhere — all internal and external communication over HTTPS/TLS
KMS key rotation enabled — automatic rotation for all customer-managed keys
Terraform state encrypted — state files containing secrets are encrypted

Logging and Monitoring

CloudTrail enabled — in all regions, with log file validation
CloudTrail logs sent to centralised S3 bucket — with lifecycle policies
GuardDuty enabled — for threat detection across accounts
Config enabled — for resource compliance tracking
Alerts for critical events — root login, IAM policy changes, security group changes

Domain 2: Cost Optimisation

We typically find 30-50% waste. See our FinOps guide for detailed strategies.

Compute

Right-sized instances — compare resource requests vs actual utilisation over 14 days
Graviton instances where applicable — 20-40% cheaper for compatible workloads
Spot instances for non-critical workloads — CI/CD, dev/staging, batch processing
No idle instances — check for EC2 instances with < 5% CPU for 7+ days
Non-production schedules — dev/staging shut down outside business hours
Savings Plans or Reserved Instances — for stable, predictable workloads

# Find instances with <5% average CPU over 7 days
for id in $(aws ec2 describe-instances --query 'Reservations[*].Instances[*].InstanceId' --output text); do
  avg=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=$id \
    --start-time $(date -u -v-7d +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 604800 --statistics Average \
    --query 'Datapoints[0].Average' --output text 2>/dev/null)
  echo "$id: $avg%"
done

Storage

S3 lifecycle policies — move infrequently accessed data to IA/Glacier
Unattached EBS volumes — delete volumes not attached to any instance
Old EBS snapshots — remove snapshots from terminated instances
CloudWatch log retention — set retention periods (not indefinite)

Kubernetes (if applicable)

Pod resource requests vs actual usage — most pods over-request by 50-80%
VPA recommendations reviewed and applied
Karpenter consolidation enabled
Namespace-level cost allocation — teams see their costs via Kubecost

Quick Wins

Delete unused Elastic IPs — $3.60/month each
Remove unused load balancers — $16-22/month each
Clean up old AMIs and snapshots
Review NAT Gateway usage — often the silent cost killer

Domain 3: Reliability

Backup and Recovery

Automated database backups — RDS automated backups enabled with appropriate retention
Backup testing — backups restored and verified at least quarterly
Disaster recovery plan documented — RPO and RTO defined for each service
Multi-AZ for production databases — single-AZ is a single point of failure
Cross-region backups — for critical data, replicate to a second region

High Availability

Load balancers in front of all web services — no single-instance production services
Auto-scaling configured — for all stateless services
Health checks enabled — load balancer health checks on all targets
Kubernetes readiness and liveness probes — on every container
No single points of failure — every critical component has redundancy

Deployment

Rolling deployments or blue-green — not big-bang deployments
Rollback mechanism — can revert to previous version in < 5 minutes
Infrastructure as Code — all infrastructure defined in Terraform/OpenTofu
No manual console changes — all changes go through IaC pipeline
Deployment frequency — can deploy to production at least weekly

Domain 4: Performance

Application

Response time p95 < 500ms — for user-facing endpoints
Database query performance — no queries taking > 1 second regularly
Connection pooling — for database connections (avoid connection storms)
CDN for static assets — CSS, JS, images served from edge locations
Caching layer — Redis/ElastiCache for frequently accessed data

Infrastructure

Instance types match workload — compute-optimised for CPU, memory-optimised for caches
Storage IOPS adequate — gp3 with provisioned IOPS for database volumes
Network throughput — instance type supports required bandwidth
DNS TTL appropriate — not too high (delays failover) or too low (excess queries)

Domain 5: Operational Maturity

CI/CD

Automated testing in pipeline — unit, integration, and/or end-to-end tests
Pipeline runs on every PR — not just on merge
Security scanning in pipeline — SAST, dependency audit, container scan
Deployment requires approval — for production environments
Pipeline execution time < 15 minutes — slow pipelines reduce deployment frequency

Monitoring and Alerting

Metrics collection — CPU, memory, disk, network, application metrics
Dashboards — for each service showing key health indicators
Alerting — critical alerts go to PagerDuty/OpsGenie, not just email
On-call rotation — defined, documented, and followed
Runbooks — for common incidents (database failover, scaling, rollback)

Documentation

Architecture diagram — current, not aspirational
Network diagram — VPCs, subnets, peering, VPNs
Access documentation — who has access to what, and why
Incident response plan — what to do when production goes down

Domain 6: Compliance and Governance

Resource tagging policy — all resources tagged with environment, team, project
AWS Organizations — separate accounts for production, staging, development
Service Control Policies — prevent dangerous actions (deleting CloudTrail, etc.)
Budget alerts — alerts at 50%, 80%, and 100% of expected spend
Regular access reviews — quarterly review of who has access to what

How We Run the Audit

Day 1: Automated scanning

Run AWS Trusted Advisor, Security Hub, and Config
Collect CloudWatch metrics for 14 days (or use existing data)
Run Prowler for security benchmarking
Export Cost Explorer data for the last 3 months

Day 2-3: Manual review

Walk through each checklist domain
Interview the team on operational practices
Review IaC codebases (Terraform, CloudFormation)
Assess CI/CD pipelines and deployment processes

Day 4: Report

Score each domain (1-5)
Prioritise findings: critical → high → medium → low
Estimate cost savings and risk reduction
Deliver actionable recommendations with timelines

Sample Audit Report Summary

Domain	Score	Key Findings
Security	2/5	3 security groups open to 0.0.0.0/0, root access keys active, no GuardDuty
Cost	2/5	45% of instances under 10% CPU, no spot usage, no lifecycle policies
Reliability	3/5	Single-AZ database, no tested backup restoration, basic health checks
Performance	3/5	CDN not configured, no caching layer, adequate response times
Operations	2/5	No IaC, manual deployments, minimal monitoring
Compliance	2/5	No tagging policy, single AWS account, no budget alerts
Overall	2.3/5	12 critical, 8 high, 15 medium findings

The average first-time audit scores 2-3 out of 5. That is normal. The goal is to identify the highest-impact improvements and prioritise them.

Want a Professional Infrastructure Audit?

We audit cloud infrastructure across AWS, Azure, and GCP — identifying security risks, cost savings, and reliability gaps in one comprehensive review.

Our DevOps consulting services include:

4-day infrastructure audit — complete review across all 6 domains with actionable report
Cost savings identification — we typically find 30-50% waste
Security hardening — fix critical findings and implement ongoing monitoring
Remediation support — implement the recommendations, not just report them

Every audit starts with a no-obligation discovery call to understand your environment.

Request a free audit consultation →

Cloud Infrastructure Audit Checklist: What We Check First (2026)

Why Audit Before Anything Else

The Audit Framework: 6 Domains

Domain 1: Security (Check First)

Identity and Access Management

Network Security

Encryption

Logging and Monitoring

Domain 2: Cost Optimisation

Compute

Storage

Kubernetes (if applicable)

Quick Wins

Domain 3: Reliability

Backup and Recovery

High Availability

Deployment

Domain 4: Performance

Application

Infrastructure

Domain 5: Operational Maturity

CI/CD

Monitoring and Alerting

Documentation

Domain 6: Compliance and Governance

How We Run the Audit

Sample Audit Report Summary

Want a Professional Infrastructure Audit?

Cloud Repatriation 2026: Why 86% of CIOs Are Moving Workloads Back

Azure Saudi Arabia Region: Q4 2026 Launch (What to Prepare Now)

Kubernetes Security News: The CVE Nobody's Talking About (Jan 2026)

AWS Saudi Arabia Region: Cloud Migration Opportunities for Vision 2030

Choosing Cloud Solutions for Regulated Industries

Concerned about security?

Tasrie IT Support

Start a conversation

Why Audit Before Anything Else

The Audit Framework: 6 Domains

Domain 1: Security (Check First)

Identity and Access Management

Network Security

Encryption

Logging and Monitoring

Domain 2: Cost Optimisation

Compute

Storage

Kubernetes (if applicable)

Quick Wins

Domain 3: Reliability

Backup and Recovery

High Availability

Deployment

Domain 4: Performance

Application

Infrastructure

Domain 5: Operational Maturity

CI/CD

Monitoring and Alerting

Documentation

Domain 6: Compliance and Governance

How We Run the Audit

Sample Audit Report Summary

Want a Professional Infrastructure Audit?

Related Articles

Cloud Repatriation 2026: Why 86% of CIOs Are Moving Workloads Back

Azure Saudi Arabia Region: Q4 2026 Launch (What to Prepare Now)

Kubernetes Security News: The CVE Nobody's Talking About (Jan 2026)

AWS Saudi Arabia Region: Cloud Migration Opportunities for Vision 2030

Choosing Cloud Solutions for Regulated Industries

Concerned about security?

Don't Miss Out on Expert DevOps Insights

Get Started

You're In!

Tasrie IT Support

Start a conversation