AI inference cost / Cost triage

AI Costs Increasing? A Triage Checklist Before You Migrate

Short answer: When AI costs increase, first separate normal usage growth from waste: longer outputs, retries, failed calls, tool loops, poor routing, missing caching, and always-warm capacity.

Decision rule
  • Triage the largest recurring driver before switching providers, self-hosting, or buying GPU capacity.
  • Verify current provider pricing directly before buying or migrating.

Next action

Triage the bill before migrating

Find whether growth, output length, retries, tool calls, routing, or realtime capacity explains the increase before changing serving mode.

Use the checklist
By Andrew Cooper, Founder of RunPlacement Updated May 2026 Provider-neutral, estimate-labeled guidance Verify current provider pricing

Right fit

  • The AI bill rose and the team cannot explain the delta.
  • A migration or self-hosting proposal is being discussed before the cost driver is known.
  • You need a practical path that complements, rather than duplicates, LLM API bill triage.

Quick checks

  • Compare successful request growth against bill growth.
  • Check average output size, retry rate, failed calls, and timeout behavior.
  • Look for tool-call loops, agent steps, or long-context growth.
  • Separate realtime warm capacity from batchable work.

Rough math

  • Cost delta = current month serving cost - normal baseline serving cost.
  • Waste check = paid attempts - successful requests.
  • Output growth check = current average output size / prior average output size.
  • Warm capacity check = paid serving hours * utilization gap.

Red flags

  • The team blames the provider before checking logs.
  • Only request count is measured, not output size.
  • Retries and failed calls are invisible.
  • Migration is proposed before caching, routing, or batching are tested.

What to do next

  • Use LLM API bill too high for API-specific triage.
  • Use the AI inference checklist to capture the missing fields.
  • Use the calculator once the largest driver is known.
  • Use batch vs realtime inference if warm capacity is the driver.

Related resources

Use a worksheet before making the call

These supporting pages turn the decision into fields a buyer, engineer, or founder can actually compare.

Related decisions

Keep narrowing the placement question

Follow the adjacent pages when the first answer exposes a deeper cost driver or operating constraint.

Framework

Use the underlying decision model

These framework pages define the terms and formulas behind this specific decision.

AI inference cost quiz

Get an AI compute cost read

Triage the largest recurring driver before switching providers, self-hosting, or buying GPU capacity.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.
Start the AI compute read

FAQ

Why are AI costs increasing?

Common reasons include real usage growth, longer outputs, retries, failed calls, multi-step workflows, larger context, missing caching, poor routing, and always-warm serving capacity.

Should rising AI costs trigger self-hosting?

Not immediately. Self-hosting should be compared after avoidable API waste, batchability, utilization, and operations overhead are visible.

What is the first metric to check?

Compare bill growth with successful request growth. If cost grew faster than useful output, inspect output length, retries, failures, and workflow steps.

Sources

AI inference cost quiz

Get an AI compute cost read

Triage the largest recurring driver before switching providers, self-hosting, or buying GPU capacity.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.
Start the AI compute read