AWS-Kosten: die häufigsten Geldgräber und wie man sie findet

The AWS bill rarely grows because someone makes an expensive wrong decision. It grows because a dozen small leaks belong to nobody. A NAT gateway here, a forgotten volume there, a fleet on On-Demand that has run the same load for two years. Each leak alone is too small to notice. Together they are the reason the bill is a little higher every month.

FinOps is not software you buy, it is a habit you establish. The native AWS tools are plenty to start with. This article is the map of the most common money pits, with concrete places to look and a checklist a team can walk through directly. Written for mid-sized companies running one to three accounts, not for a hyperscaler FinOps department.

What it is not: a guide to chasing every last decimal. Three or four categories usually account for most of the savings, the rest is fine-tuning. All pricing values are as of June 2026 for us-east-1, eu-central-1 differs slightly. For your own region, the AWS Pricing Calculator always wins.

The NAT Gateway, the Quiet Classic

The NAT gateway is the line item that shows up in almost every audit. It costs around 0.045 USD per hour just for being available, so about 33 USD per month per gateway. On top come 0.045 USD per processed GB, and on top of that the regular data transfer charges. In a multi-AZ setup with one gateway per availability zone, you are quickly at three gateways before a single byte has flowed.

The most common avoidable mistake: traffic to S3 and DynamoDB runs through the NAT, even though a gateway VPC endpoint is free for exactly that. No hourly rate, no data charge. A gateway endpoint for S3 is a few lines of Terraform and takes the perhaps biggest chunk out of the NAT data charge:

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.eu-central-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
}

For other AWS services there are interface endpoints over PrivateLink. Those cost an hourly and a data charge, but save the NAT data charge for chatty service calls, for example to ECR, Secrets Manager, or all the pull traffic of a container deployment. Since late 2025, there is also the Regional NAT Gateway, billed per AZ per hour, which shifts the math for some multi-AZ topologies.

Where to look: in Cost Explorer, filter by the usage type that contains "NatGateway-Bytes". To find which workloads fill the NAT, take the VPC Flow Logs and look for the top talkers.

Idle and Orphaned Resources

Unattached Volumes and Old Snapshots

An EBS volume keeps running even when the EC2 instance it was attached to is long gone. 100 GB of gp3 is around 8 USD per month that nobody uses. In an account that has grown over a few years, you often find dozens of them. The sweep takes a minute:

aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query "Volumes[].{ID:VolumeId,Size:Size,Type:VolumeType,Created:CreateTime}" \
  --output table

status=available means attached to nothing. Every volume in this list is a candidate for deletion, after a snapshot to be safe. Speaking of snapshots: they are incremental, but they pile up. Without a lifecycle policy via the Data Lifecycle Manager, the snapshot heap grows endlessly, because everyone creates them and nobody cleans up.

Load Balancers, Elastic IPs, and the IPv4 Price

An Application Load Balancer with no registered targets still costs the hourly fee plus its LCUs. A relic from a shut-down project that nobody notices. The same goes for Elastic IPs, and here the math got worse: since February 1, 2024, every public IPv4 address costs 0.005 USD per hour, around 3.60 USD per month. And not just the unused ones, but every single one, including those on a running instance. On a larger fleet that is a real line item. Find unused EIPs like this:

aws ec2 describe-addresses \
  --query "Addresses[?AssociationId==null].{IP:PublicIp,AllocationId:AllocationId}" \
  --output table

Dev Environments Around the Clock

The biggest hidden line item in mid-sized companies is dev and staging environments running 24/7 even though someone only touches them during work hours. A non-production fleet that sleeps in the evening and on weekends runs roughly 50 of 168 weekly hours. That is a saving of about two thirds on that part of the bill, with a simple scheduler that shuts the instances and RDS databases down at night. Nobody needs a staging database at three in the morning.

Right-Sizing, the Biggest Lever

Right-sizing is the biggest lever and at the same time the most uncomfortable one, because it demands discipline instead of a one-time click. The pattern is the same everywhere: instances are sized generously at launch and never touched again. An m5.xlarge that has idled at 8 percent CPU for a year burns money every month without anyone seeing it.

Cost Optimization Hub and Compute Optimizer

AWS delivers the analysis for free. Compute Optimizer evaluates the actual utilization of EC2, RDS, and Lambda and suggests smaller or more modern instances. The Cost Optimization Hub bundles these recommendations across all accounts and regions into one view, with 18 recommendation types, including right-sizing, idle detection, Graviton migration, and Savings Plans and RI suggestions. You can set the lookback to 14, 32, or 93 days. I take the 93 days, because a short window swallows every seasonal peak.

The caution here: right-sizing needs real lookback data and headroom for load peaks, not the average of a quiet weekend. Optimize for the average and you produce the next incident.

Graviton Migration

The underrated part of the recommendations is the migration to Graviton. The ARM-based instances are 20 to 25 percent cheaper depending on the family, at equal or better performance, and the newer Graviton4 families deliver even more price-performance. For PHP, Node, and most web workloads the switch is uncritical, it is essentially an image rebuild for ARM. If you are on Fargate anyway, you flip the architecture in the task definition and build the image for ARM, done.

Steady State on On-Demand

On-Demand is the right choice for variable and unpredictable load. For a baseline that has run steadily for months, it is the most expensive option AWS offers. Running a constant base load entirely on On-Demand wastes between 30 and 70 percent, depending on the commitment.

Layering Savings Plans Correctly

Compute Savings Plans are the flexible default. You commit an hourly amount, say 10 USD per hour, and get a discount that applies across region, instance family, size, operating system, and even Fargate and Lambda. That flexibility is why I prefer them over the more rigid EC2 Instance Savings Plans and Reserved Instances, even though those offer a slightly higher discount. RIs only pay off for truly fixed workloads that are guaranteed not to change.

The rule of thumb is simple and important: cover the stable baseline with Savings Plans, let the peak run on On-Demand. Never commit 100 percent. Commit your entire capacity and you punish every load change and every architecture rework, because the commitment keeps running no matter what. Seventy to eighty percent of the baseline is a healthy value.

Spot for Interruptible Work

For interruptible workloads, Spot is up to 70 percent cheaper than On-Demand. That fits the queue workers from the Fargate article perfectly, along with batch jobs and CI runners. The only requirement is that the workloads tolerate an interruption, meaning they are idempotent and can resume their work after an abort. The web service with user traffic stays on On-Demand or Savings Plans.

Storage Waste

gp2 Instead of gp3

The closest gift in almost every account: gp2 volumes still running on gp2. gp3 costs 0.08 instead of 0.10 USD per GB-month, that is 20 percent less, and ships 3,000 IOPS and 125 MB/s of throughput for free, so often more performance than the old gp2. The migration runs without downtime, AWS modifies the volume while it is in use:

aws ec2 modify-volume --volume-id vol-0abc123 --volume-type gp3

You find the gp2 candidates with the same describe-volumes, filtered on volume-type=gp2. There is rarely a reason not to do this immediately.

S3 Without Lifecycle

Data nobody reads anymore often sits on S3 Standard and costs full price. Lifecycle rules move it automatically to Infrequent Access and later to Glacier, or you use S3 Intelligent-Tiering, which picks the class based on the access pattern itself. One quiet line item almost everyone has is aborted multipart uploads: fragments that were never finished, never cleaned up, and still cost storage. One lifecycle rule handles both:

{
  "Rules": [
    {
      "ID": "tiering-and-cleanup",
      "Status": "Enabled",
      "Filter": {},
      "Transitions": [
        { "Days": 30, "StorageClass": "STANDARD_IA" },
        { "Days": 90, "StorageClass": "GLACIER" }
      ],
      "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 }
    }
  ]
}

Data Transfer, the Invisible Line Item

Data transfer is the line item almost nobody watches, because it shows up in no architecture discussion. Internet egress is tiered: the first 100 GB per month are free, aggregated across all services, after which the next terabyte block costs 0.09 USD per GB and falls with volume.

The more interesting line item is internal traffic. Inter-AZ traffic costs 0.01 USD per GB, and that is per direction. A chatty multi-AZ app whose services constantly talk to each other across availability zones pays that in both directions. It sounds like little, but on data-heavy workloads it adds up to a surprisingly large block. Cross-region replication, often switched on for convenience, is more expensive still. And NAT egress is a double hit: traffic through the NAT pays the NAT data charge plus the egress charge.

Where to look: in Cost Explorer, group by usage type and search for the items that contain "DataTransfer". The numbers are often higher than expected, especially for apps with a lot of internal communication.

Observability That Eats Itself

Observability costs money, and past a certain point it eats itself. CloudWatch charges for the ingestion and storage of logs, and debug logs in production with unlimited retention are a quiet permanent line item. Every custom metric costs too, and a sprawling metric strategy adds up faster than you think.

The right balance is enough observability to see incidents, but not so much that watching becomes more expensive than what is being watched. A reasonable log retention, debug level only temporarily, and a look at the number of custom metrics usually do it. This point gets more room in the upcoming article on observability in production.

Tagging, the Most Important Unglamorous Measure

You cannot optimize what you cannot attribute. Without consistent cost allocation tags, every optimization is guesswork, because the bill stays one big number instead of breaking down by team, project, and environment. Tagging is the most boring measure in this article and the most important.

In practice that means: activate cost allocation tags in the billing console, define a lean tag policy (Environment, Team, Project, Owner are enough to start), and enforce tagging via a tag policy or service control policy, so nobody creates untagged resources. Only then do Cost Explorer and Budgets show costs per team, and the lump-sum bill becomes a manageable view.

The actual point is organizational, not technical: a money pit without an owner never gets fixed. Tagging creates responsibility, not just a report. When a team sees its own costs, behavior changes on its own.

The Audit Checklist

The following checklist covers the eight categories. A team can walk through it in an afternoon and ends up with a list of concrete actions, sorted by effort.

Category	Check	Action
NAT Gateway	Does S3 or DynamoDB traffic run through the NAT?	Create gateway VPC endpoints
NAT Gateway	How many NAT gateways, how high the data charge?	Interface endpoints for chatty services
Idle	Volumes with status=available?	Snapshot, then delete
Idle	Snapshots without a lifecycle policy?	Set up Data Lifecycle Manager
Idle	Load balancers without targets?	Shut down
Idle	Unused or surplus Elastic IPs?	Release (0.005 USD/h per IP)
Idle	Dev/staging 24/7?	Night and weekend scheduler
Right-Sizing	Cost Optimization Hub enabled?	Enable, 93-day lookback
Right-Sizing	Oversized EC2/RDS?	Apply recommendations, with headroom
Right-Sizing	x86 instead of Graviton?	Migrate to ARM (20 to 25 percent)
Savings Plans	How much of the baseline is committed?	Compute Savings Plans at 70 to 80 percent
Savings Plans	Interruptible workloads on On-Demand?	Move to Spot
Storage	gp2 volumes present?	Modify to gp3 (no downtime)
Storage	S3 without lifecycle?	Tiering and cleanup rules
Data Transfer	High inter-AZ or cross-region costs?	Review data flows, keep AZ-local
Observability	Log retention and custom metrics?	Set retention, thin out metrics
Tagging	Cost allocation tags active and enforced?	Tag policy plus SCP

That is the free version. The detailed one is available as the AWS Cost-Optimization Audit Kit: the checklist as a PDF with a 60-minute guide plus a spreadsheet calculator that projects the concrete saving per category.

How an Audit Runs

The order is the difference between an audit and poking around. First Cost Explorer for the overview, where the money actually goes. Then the Cost Optimization Hub for the quantified recommendations. Then the tags, to attribute the costs. Then the CLI sweeps for the orphaned resources that show up in no recommendation. And only then do you prioritize by impact and effort.

The quick wins first: gp2 to gp3, unused Elastic IPs, idle load balancers, orphaned volumes. Little effort, immediate effect, and they build the confidence for the larger steps. The hard ones after that: right-sizing, the Savings Plans strategy, the NAT architecture. More effort, but also the larger share of the savings.

One more thing that regularly goes wrong: cleaning up once is not enough. The bill grows again the moment the next team forgets an instance. A quarterly review keeps it clean, and with tags and an enabled Cost Optimization Hub, the second pass takes only a fraction of the time.

What Management Should Ask

You do not need to be an AWS architect to ask the right questions. Three are enough to tell whether costs are managed at all. What percentage of our compute bill is covered by Savings Plans? When did the last right-sizing happen? Do we have tags that show us costs per team? If the answers are "don't know", "never", and "no", the savings are almost certainly in the double-digit percent range.

The 80/20 applies here too. Three or four categories usually make up the bulk, and it is not worth having the team chase the last few percent for weeks. When someone external makes sense depends on two things: whether anyone internal has the time and the FinOps knowledge, and how large the bill is. At a five-figure monthly bill, a targeted audit usually pays for itself many times over, simply because the items found would otherwise keep running month after month.

Anti-Patterns

Optimizing without tags. Guessing instead of measuring. Without attribution, nobody knows which line item belongs to which team, and nothing ever gets fixed.

Committing 100 percent to Savings Plans. Takes away the flexibility and punishes every load change. The peak belongs on On-Demand and Spot.

Right-sizing to the average. Ignores load peaks and produces the next incident. Always with headroom.

Cleaning up once and never again. The bill grows again immediately. FinOps is a quarterly rhythm, not a project.

Dev environments 24/7. The biggest avoidable line item in mid-sized companies. A scheduler saves roughly two thirds here.

NAT for S3 traffic. Pays per GB for what would be free over a gateway endpoint.

Buying cost tooling before doing the homework. The native AWS tools are enough to start. A third-party tool only pays off once tagging and basic discipline are in place.

Conclusion

The AWS bill is not fate, it is the sum of many small, fixable decisions. Most of the money pits in this article are findable in an afternoon and half fixed in a week. The NAT swallowing unnecessary S3 traffic, the fleet without Savings Plans, the gp2 volumes, the orphaned resources: all concrete, all measurable, all fixable.

What remains is the habit. Tags, so you can see where the money flows. A quarterly review, so the bill does not run away again. Savings Plans on the baseline, so the base load does not run at full price. That is less work than most people think, and it is the difference between a bill you understand and one that simply grows higher every month.

Is your AWS bill growing faster than your load, and nobody has the time to look into it? Contact me for a technical FinOps audit that finds the money pits in two to three days and prioritizes them by effort and impact.

The self-service version is ready for download: the AWS Cost-Optimization Audit Kit with the checklist, the 60-minute guide, and the spreadsheet calculator — free, directly via the form below this article or on the resources page.