Aller au contenu

That Time Our Kubernetes Auto-Scaling Cost Us $15,000 in One Weekend

Sitemap## AWS in Plain English

AWS in Plain English

New AWS, Cloud, and DevOps content every day. Follow to join our 3.5M+ monthly readers.

(And how we stopped it from happening again)

The Panic Sets In

I was brushing my teeth on a Sunday morning when my phone started blowing up:

Slack Alert from AWS: “Your monthly spend has exceeded 80% of budget”
Finance Team: “Why is there a $5,000 charge from AWS this morning?!”

Turns out, our “smart” Kubernetes auto-scaling had gone completely rogue. What we thought was a minor config tweak on Friday afternoon had spun up 87 copies of a service that normally runs 3 pods — burning cash faster than a crypto startup at a Vegas conference.

Here’s what happened (with screenshots from our actual incident report), and how we fixed it.

The Perfect Storm of Bad Decisions

We were running a Java service that processes background jobs. It worked fine for months… until we tried to “optimize” it.

Mistake #1: The Overconfident HPA Config

Our HorizontalPodAutoscaler looked reasonable at first glance:

# What we HOPED would happen:
# "Gently scale between 2-20 pods based on CPU"
# What ACTUALLY happened:
# "SCALE ALL THE THINGS!!"  
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  maxReplicas: 100  # 🤦 WHY did we think this was okay?!
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        averageUtilization: 50  # Aggressive AF

Mistake #2: No Safeguards

  • No PodDisruptionBudget: When scaling down, Kubernetes murdered pods like Game of Thrones characters
  • No AWS billing alerts: We found out from Finance, not our monitoring
  • Bad metric: CPU was a terrible scaling signal for this workload

How We Debugged The Madness

1. The “Oh Sh*t” Moment

Ran kubectl top pods and saw:

NAME                          CPU(cores)  
worker-service-abc123         5m          # Basically idle  
worker-service-def456         7m  
... (87 more lines of this nonsense)

2. The Root Cause

  • Our Java app had brief CPU spikes during GC
  • HPA saw this as “OMG WE NEED MORE PODS”
  • Each new pod caused more GC spikes, creating a feedback loop

(Here’s an actual screenshot from our Prometheus dashboard showing the insanity:)
[INSERT IMAGE: CPU % spiking like a heartbeat gone wrong]

How We Fixed It (For Real This Time)

1. We Stopped Using CPU for Scaling

Switched to Kafka queue depth metrics (since this was a queue worker):

metrics:
- type: External  # Now scales based on actual work
  external:
    metric:
      name: kafka_messages_behind  
    target:
      averageValue: 1000

2. Added Scaling Friction

maxReplicas: 10  # Hard ceiling  
minReplicas: 1   # Let it breathe

3. Cost Controls That Actually Work

  • AWS Budget Alert: Now emails AND Slack at $500 increments
  • Cluster Autoscaler Settings:
  • bash
  • Copy
  • Download
  • — scale-down-unneeded-time=15m # Don’t react too fast — skip-nodes-with-local-storage=true # Protect stateful stuff

Lessons Learned (The Hard Way)

  1. Auto-scaling isn’t “set and forget” — It’s more like a pet than cattle
  2. Test scaling changes on Friday? Bad idea. Deploy scaling tweaks on Tuesday mornings
  3. Finance teams make great alert systems (but they won’t be happy about it)

Pro Tip: Run kubectl get hpa -w in a terminal before deploying HPA changes. Seeing numbers jump in real-time is terrifying… but less terrifying than an AWS bill.

Your Turn

Ever had Kubernetes auto-scaling backfire? Reply with your disaster story — I’ll buy you a coffee if yours was more expensive than ours.

Thank you for being a part of the community

Before you go:

AWS in Plain English

AWS in Plain English

Last published 4 hours ago

New AWS, Cloud, and DevOps content every day. Follow to join our 3.5M+ monthly readers.

CI/CD, AWS & K8s expert. Cut costs by 90%, sped up deploys. Passionate about scalable systems & ending over-engineering. Let’s build smarter. 🚀