AI inference cost / Cost triage

AI Costs Increasing? A Triage Checklist Before You Migrate

Short answer: When AI costs increase, first separate normal usage growth from waste: longer outputs, retries, failed calls, tool loops, poor routing, missing caching, and always-warm capacity.

Decision rule

Triage the largest recurring driver before switching providers, self-hosting, or buying GPU capacity.
Verify current provider pricing directly before buying or migrating.

Next action

Triage the bill before migrating

Find whether growth, output length, retries, tool calls, routing, or realtime capacity explains the increase before changing serving mode.

Use the checklist

By Andrew Cooper, Founder of RunPlacement Updated May 2026 Provider-neutral, estimate-labeled guidance Verify current provider pricing

Right fit

The AI bill rose and the team cannot explain the delta.
A migration or self-hosting proposal is being discussed before the cost driver is known.
You need a practical path that complements, rather than duplicates, LLM API bill triage.

Quick checks

Compare successful request growth against bill growth.
Check average output size, retry rate, failed calls, and timeout behavior.
Look for tool-call loops, agent steps, or long-context growth.
Separate realtime warm capacity from batchable work.

Rough math

Cost delta = current month serving cost - normal baseline serving cost.
Waste check = paid attempts - successful requests.
Output growth check = current average output size / prior average output size.
Warm capacity check = paid serving hours * utilization gap.

Red flags

The team blames the provider before checking logs.
Only request count is measured, not output size.
Retries and failed calls are invisible.
Migration is proposed before caching, routing, or batching are tested.

What to do next

Use LLM API bill too high for API-specific triage.
Use the AI inference checklist to capture the missing fields.
Use the calculator once the largest driver is known.
Use batch vs realtime inference if warm capacity is the driver.

Related resources

Use a worksheet before making the call

These supporting pages turn the decision into fields a buyer, engineer, or founder can actually compare.

AI inference costAI Inference Cost ChecklistChecklist / 8 sections / source-linked

A practical checklist for estimating AI inference cost across APIs, managed inference, self-hosted GPUs, batch jobs, realtime endpoints, and hybrid routing.

Related decisions

Keep narrowing the placement question

Follow the adjacent pages when the first answer exposes a deeper cost driver or operating constraint.

AI inference costLLM API Bill Too High? What to Check FirstCost triage

A high LLM API bill is usually a triage problem first: check whether output size, retries, tool calls, caching gaps, routing, or batchable work are driving the increase.

AI inference costAI Cost Optimization: Practical Levers Before Rebuilding InferenceOptimization guide

AI cost optimization usually starts with usage shape: reduce avoidable output, retries, failed calls, over-large prompts, expensive routing, and low utilization before changing infrastructure.

AI inference costBatch vs Realtime Inference Cost: How to ChooseCost estimation

Batch inference is often cheaper when latency is flexible because work can be queued for higher utilization; realtime inference costs more when warm capacity and strict latency are required.

AI inference cost

When the GPU question is really serving cost

Use these pages when the same GPU quote, idle-cost, or useful GPU-hour question is about production inference rather than one-off training.

Estimator landing pageAI Cost CalculatorStart with broad AI cost, then narrow to API, managed inference, GPU, batch, realtime, or hybrid serving. Interactive calculatorAI Inference Cost CalculatorCompare API, managed inference, and self-hosted GPU cost per successful request. Decision pageAI Cost ComparisonCompare serving categories before ranking providers or quotes. Decision treeAPI vs Self-Hosted InferenceDecide when API simplicity, managed serving, or self-hosted GPU control fits. Optimization guideAI Cost OptimizationCheck output length, retries, routing, caching, batching, and utilization before rebuilding inference. Triage pageAI Costs IncreasingFind the driver before moving off APIs, switching platforms, or buying GPUs. Research guideRealtime vs Batch ResearchDecide when queueing, delay tolerance, and avoided warm capacity can change inference cost. Formula pageInference Cost Per RequestUse monthly serving cost divided by successful requests as the common comparison unit. FrameworkAI Inference Cost ModelNormalize serving options by monthly cost and successful requests.

Framework

Use the underlying decision model

These framework pages define the terms and formulas behind this specific decision.

AI inference costAI Inference Cost ModelAI inference cost

AI inference cost should be compared as effective cost per successful request and monthly serving cost, not just token price or GPU hourly rate.

GPU pricingUseful GPU-Hour Frameworkuseful GPU-hour

Useful GPU-hour cost is the better comparison unit when GPU providers differ in utilization, queueing, reliability, storage behavior, or operational model.

AI inference cost quiz

Get an AI compute cost read

Triage the largest recurring driver before switching providers, self-hosting, or buying GPU capacity.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.

Start the AI compute read

FAQ

Why are AI costs increasing?

Common reasons include real usage growth, longer outputs, retries, failed calls, multi-step workflows, larger context, missing caching, poor routing, and always-warm serving capacity.

Should rising AI costs trigger self-hosting?

Not immediately. Self-hosting should be compared after avoidable API waste, batchability, utilization, and operations overhead are visible.

What is the first metric to check?

Compare bill growth with successful request growth. If cost grew faster than useful output, inspect output length, retries, failures, and workflow steps.

Sources

AI inference cost quiz

Get an AI compute cost read

Triage the largest recurring driver before switching providers, self-hosting, or buying GPU capacity.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.

Start the AI compute read