AI inference cost / Formula
Inference Cost Per Request: Simple Formula
Short answer: A useful inference cost per request starts with total monthly serving cost divided by successful inference requests, with failed calls and retries handled explicitly.
- Use successful requests as the denominator, and include all serving costs in the numerator before comparing options.
- Verify current provider pricing directly before buying or migrating.
Right fit
- You need one metric for API, managed inference, and self-hosted GPU.
- A product margin discussion needs a simple cost unit.
- Provider unit pricing is making the comparison hard to explain.
Quick checks
- Define successful request consistently.
- Separate retries and failed calls from successful output.
- Include platform fees, warm capacity, shared infrastructure, and operations.
Rough math
- Inference cost per request = total monthly serving cost / successful requests.
- Total monthly serving cost = API or infrastructure cost + shared infrastructure + operations overhead.
- Retry-adjusted cost rises when failed calls consume billable work.
Red flags
- The denominator includes failed requests as if they created value.
- The numerator excludes idle capacity or engineer time.
- The team compares token price and GPU rate without normalizing to requests.
What to do next
- Open the AI inference cost calculator.
- Read the AI inference cost model.
- Use API versus self-hosted inference when the formula exposes a serving-mode decision.
- Use the AI inference cost checklist to collect the inputs.
Related resources
Use a worksheet before making the call
These supporting pages turn the decision into fields a buyer, engineer, or founder can actually compare.
A practical checklist for estimating AI inference cost across APIs, managed inference, self-hosted GPUs, batch jobs, realtime endpoints, and hybrid routing.
GPU pricingGPU Cloud Quote ChecklistChecklist / 7 sections / source-linkedA practical checklist and visual worksheet for comparing GPU cloud quotes beyond the advertised hourly rate.
Related decisions
Keep narrowing the placement question
Follow the adjacent pages when the first answer exposes a deeper cost driver or operating constraint.
API inference usually wins for uncertain or low-volume workloads; self-hosted inference can win when volume, utilization, latency, or control needs justify GPU operations.
AI inference costSelf-Hosted LLM Inference Cost: What to IncludeCost estimationThe GPU hourly rate is only the starting point for self-hosted LLM inference cost; warm capacity, utilization, storage, networking, monitoring, reliability, upgrades, and team time all belong in the estimate.
AI inference costLLM API Bill Too High? What to Check FirstCost triageA high LLM API bill is usually a triage problem first: check whether output size, retries, tool calls, caching gaps, routing, or batchable work are driving the increase.
AI inference cost
When the GPU question is really serving cost
Use these pages when the same GPU quote, idle-cost, or useful GPU-hour question is about production inference rather than one-off training.
Framework
Use the underlying decision model
These framework pages define the terms and formulas behind this specific decision.
AI inference cost quiz
Get an AI compute cost read
Use successful requests as the denominator, and include all serving costs in the numerator before comparing options.
Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.FAQ
What is inference cost per request?
It is total monthly serving cost divided by successful inference requests.
Should failed requests be included?
Failed or retried calls should be measured because they can create cost, but the main denominator should be successful requests that produced useful output.
Why use this formula?
It gives API, managed inference, and self-hosted GPU a common unit for product margin and infrastructure decisions.
Sources
- https://platform.openai.com/docs/pricing
- https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html
- https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html
- https://cloud.google.com/vertex-ai/docs/predictions/overview
- https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html
AI inference cost quiz
Get an AI compute cost read
Use successful requests as the denominator, and include all serving costs in the numerator before comparing options.
Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.