AWS bill shock
AWS Bill Shock Triage Checklist
Short answer: Use this before assuming AWS itself is the wrong placement.
- This is a decision checklist, not a final price quote.
- Verify final numbers against provider pricing pages and your own bill or quote.
First 30 minutes
Use This Before Debating Migration
A bill spike needs triage before architecture conclusions. Find the recurring driver first.
Copy the triage rows
Paste service, region, account, current spend, baseline spend, and owner into a sheet.
Rank the deltas
Start with the largest recurring month-over-month change, not the loudest complaint.
Classify the driver
Separate compute, network, storage, observability, managed service, support, marketplace, and one-time events.
Filled example
Example: Network Spike Triage
Hypothetical first-pass triage, not a pricing claim.
| Input | Hypothetical value |
|---|---|
| Largest delta | NAT Gateway line item increased from a normal baseline. |
| Likely driver | Routing or traffic path changed after a deployment. |
| Next move | Check route tables, processed bytes, deployment history, and workload owner before migration planning. |
What it flags: The useful first answer is the cost driver and owner, not whether the whole cloud account should move.
Use this when
- An AWS bill jumped and the team does not know which line item caused it.
- People are arguing about migration before the recurring driver is isolated.
- Networking, logs, storage, managed databases, or idle resources may explain the surprise.
Not for
- A full FinOps program or chargeback model.
- Replacing Cost Explorer, CUR, or detailed account-level analysis.
- Declaring AWS too expensive before the bill delta is understood.
Bill shock triage
Do not start with migration. Start with the delta.
Most surprise bills become less mysterious when the biggest month-over-month change is isolated.
Compare this month to the last normal month by service and region.
Separate compute, networking, storage, observability, and managed services.
Look for traffic change, architecture change, retention change, or idle resource.
Delete, resize, re-architect, commit, or migrate only after the driver is known.
Worksheet Fields
Use this as the working version before copying the decision into a doc, ticket, or vendor email.
| Field | Capture | Why it matters |
|---|---|---|
| Baseline | Last normal month, current month, service, region, account, owner. | Separates a real trend from a one-off event. |
| Driver class | Compute, network, storage, observability, managed service, support, marketplace. | Keeps the triage from becoming vague cloud blame. |
| Change event | Traffic, logging, retention, deployment, data path, backup, scale setting. | Connects the bill to something that actually changed. |
| Action | Delete, resize, cap, alert, add endpoint, change retention, commit, migrate. | Turns surprise into a concrete next move. |
Triage-ready
Copy Into A Triage Sheet
Paste this tab-separated block into a sheet when an AWS bill jumps. The goal is to isolate the recurring driver before debating migration.
Field What to enter Hypothetical example Why it matters Service AWS service with the biggest month-over-month delta NAT Gateway Starts with the line item, not general cloud blame. Region Region where the spend changed us-east-1 Finds regional routing or workload changes. Account or project Account, environment, or owner group production account Helps find who can explain the change. Tag or workload Tag, service name, app, or workload tied to the cost batch workers Connects spend to ownership. Current spend Current month spend for the line item $4,200 placeholder Shows the size of the issue. Baseline spend Last normal month for the same line item $900 placeholder Separates a spike from normal run rate. Delta Current spend minus baseline spend $3,300 placeholder Ranks what to investigate first. Likely driver Compute, network, storage, observability, managed service, support, or marketplace Network path change Keeps triage specific. Change event Deployment, traffic, logging, retention, backup, route, or scale change Private subnet routing changed Links bill movement to a cause. Owner Person or team responsible for the workload or setting Platform team Makes the next check actionable. Next check Delete, resize, cap, alert, route change, retention change, or deeper analysis Check route tables and NAT processed bytes Turns surprise into a concrete step.
AI prompt
Prompt To Triage A Cloud Bill Spike
Use this with bill exports, Cost Explorer notes, or manually copied line items. It should classify the driver before recommending migration.
You are helping me triage a cloud bill spike. Do not assume provider pricing beyond the line items I provide. Do not recommend migration until the recurring cost driver is classified. Here are the bill details: [Paste service, region, account, current spend, baseline spend, tags, and known changes here] Please: 1. Identify the largest recurring month-over-month delta. 2. Classify the likely driver as compute, network, storage, observability, managed service, support, marketplace, or one-time event. 3. List the most likely change events that could explain the delta. 4. Recommend the next checks before changing providers. 5. Separate quick configuration fixes from architecture changes and migration questions. 6. Label unknowns and avoid unsupported pricing, benchmark, or provider-ranking claims.
Short Answer
- Most AWS bill shock needs a line-item triage before a migration decision.
- Start with the biggest delta from the previous normal month.
- A high bill is not enough evidence that AWS is the wrong placement; it may be one architecture decision, one logging change, or one idle resource class.
Line Items To Check First
- NAT Gateway processing and hourly charges.
- Cross-AZ, inter-region, and internet data transfer.
- CloudWatch logs, metrics, retention, and custom metrics.
- S3 storage class, requests, lifecycle gaps, retrieval, replication, and transfer.
- Idle EC2, overprovisioned instances, unattached EBS volumes, and forgotten load balancers.
- Managed databases, snapshots, backups, and provisioned throughput.
- Support plan changes, marketplace products, and one-off service usage.
Triage Table
- If networking jumped: inspect NAT, cross-AZ paths, egress, region movement, and load balancer traffic.
- If observability jumped: inspect log volume, retention, custom metrics, and debug logging.
- If storage jumped: inspect request volume, lifecycle rules, replication, retrieval, and snapshots.
- If compute jumped: inspect idle capacity, autoscaling, instance families, commitments, and GPU usage.
- If managed services jumped: inspect provisioned capacity, backups, replicas, and default settings.
Rough Math
- Monthly surprise = current monthly line item - previous normal baseline.
- Repeatable surprise = line-item delta expected to recur next month.
- Fix payback = engineering time cost / monthly savings.
- If one repeatable line item explains most of the jump, fix that before changing providers.
Questions To Ask Internally
- What changed in traffic, logging, data volume, architecture, or retention?
- Did a private subnet path start routing through NAT unexpectedly?
- Did a debug flag or log level stay on?
- Did data start crossing zones or regions?
- Did a test workload become always-on?
- Can this be capped, alerted, or deleted today?
Red Flags
- Private subnet traffic going through NAT by default.
- Debug logs retained like production audit logs.
- Data movement priced after architecture, not before.
- No owner for old resources.
- Dashboards that show total spend but not the delta driver.
When To Use The Quiz
- Use the RunPlacement quiz after identifying whether the bill is mostly compute, networking, storage, observability, managed services, or GPU.
- The quiz helps decide whether to optimize AWS, move the workload, or choose a simpler category.
FAQ
What should I check first after an AWS bill spike?
After an AWS bill spike, check the largest month-over-month delta by service, region, and account. Then classify the driver as compute, network, storage, observability, managed service, support, or marketplace. The first fix should target the recurring driver, not the total bill in isolation.
Can NAT Gateway cause AWS bill shock?
Yes. NAT Gateway can cause AWS bill shock because hourly usage and processed data can both matter, especially when private subnet traffic takes an unexpected path. Verify current AWS pricing pages before estimating the amount, then inspect routes, endpoints, cross-AZ paths, and high-volume workloads.
Should I migrate away from AWS after one bad bill?
Usually no. Do not migrate away from AWS after one bad bill until the increase is understood. First decide whether the spike is recurring, fixable in place, or caused by one architecture, retention, logging, routing, or idle-resource decision. Migration should be a payback decision.
Sources
- https://docs.aws.amazon.com/cur/latest/userguide/what-is-cur.html
- https://aws.amazon.com/vpc/pricing/
- https://aws.amazon.com/cloudwatch/pricing/
- https://aws.amazon.com/s3/pricing/
- https://aws.amazon.com/aws-cost-management/aws-cost-and-usage-reporting/
- https://docs.aws.amazon.com/cost-management/latest/userguide/dashboards.html
RunPlacement quiz
Pressure-test this workload
Find the top bill drivers first, then decide whether to optimize, re-architect, or migrate.
Uses workload type, budget, GPU need, data movement, priority, and ops tolerance.