AI inference cost / Formula guide
AI Cost Per Token: When Token Price Helps and When It Misleads
Short answer: AI cost per token is useful for API estimates, but it can mislead when output length, retries, multi-step workflows, failed calls, or fixed serving capacity dominate cost.
- Use token price for API usage math, then convert the estimate into cost per successful request and monthly serving cost before comparing serving modes.
- Verify current provider pricing directly before buying or migrating.
Next action
Connect token price to product cost
Token price matters most after output length, retries, failed calls, traffic mix, and shared serving overhead are included.
Compare monthly costRight fit
- You are estimating API spend from prompt and output size.
- A teammate is using token price as the whole cost comparison.
- You need to connect token usage to product margin or monthly budget.
Quick checks
- Estimate input and output tokens separately.
- Include retries, tool calls, multi-step chains, and failed calls.
- Measure successful requests rather than all attempts.
- Check whether fixed managed or GPU capacity changes the denominator.
Rough math
- API estimate = input tokens / 1,000,000 * input price + output tokens / 1,000,000 * output price.
- Request-level cost = expected input cost + expected output cost + retry and workflow allowance.
- Effective serving cost = total monthly serving cost / successful requests.
Red flags
- Only input tokens are counted.
- The estimate ignores long outputs.
- Failed attempts are treated as if they created user value.
- Token price is used to dismiss warm GPU, storage, network, or ops cost.
What to do next
- Use the calculator with your current token prices and request volume.
- Use inference cost per request to normalize the result.
- Use AI cost optimization if output size, retries, or routing are the main drivers.
Related resources
Use a worksheet before making the call
These supporting pages turn the decision into fields a buyer, engineer, or founder can actually compare.
Related decisions
Keep narrowing the placement question
Follow the adjacent pages when the first answer exposes a deeper cost driver or operating constraint.
A useful inference cost per request starts with total monthly serving cost divided by successful inference requests, with failed calls and retries handled explicitly.
AI inference costAI Cost Optimization: Practical Levers Before Rebuilding InferenceOptimization guideAI cost optimization usually starts with usage shape: reduce avoidable output, retries, failed calls, over-large prompts, expensive routing, and low utilization before changing infrastructure.
AI inference costLLM API Bill Too High? What to Check FirstCost triageA high LLM API bill is usually a triage problem first: check whether output size, retries, tool calls, caching gaps, routing, or batchable work are driving the increase.
AI inference cost
When the GPU question is really serving cost
Use these pages when the same GPU quote, idle-cost, or useful GPU-hour question is about production inference rather than one-off training.
Framework
Use the underlying decision model
These framework pages define the terms and formulas behind this specific decision.
AI inference cost should be compared as effective cost per successful request and monthly serving cost, not just token price or GPU hourly rate.
GPU pricingUseful GPU-Hour Frameworkuseful GPU-hourUseful GPU-hour cost is the better comparison unit when GPU providers differ in utilization, queueing, reliability, storage behavior, or operational model.
AI inference cost quiz
Get an AI compute cost read
Use token price for API usage math, then convert the estimate into cost per successful request and monthly serving cost before comparing serving modes.
Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.FAQ
Is cost per token the same as cost per request?
No. Cost per request depends on the number of input tokens, output tokens, retries, failed calls, and workflow steps used by each successful request.
Why can output tokens dominate AI API cost?
Output tokens often have a separate price and can grow with verbose responses, chain-of-thought-like drafts, tool summaries, or unbounded generation.
Can token price compare API and self-hosted GPU?
Only as a starting point. A fair comparison also includes warm capacity, utilization, shared infrastructure, and operations work.
Sources
AI inference cost quiz
Get an AI compute cost read
Use token price for API usage math, then convert the estimate into cost per successful request and monthly serving cost before comparing serving modes.
Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.