AI inference cost / Formula

Inference Cost Per Request: Simple Formula

Short answer: A useful inference cost per request starts with total monthly serving cost divided by successful inference requests, with failed calls and retries handled explicitly.

Decision rule

Use successful requests as the denominator, and include all serving costs in the numerator before comparing options.
Verify current provider pricing directly before buying or migrating.

Next action

Use one comparison unit

Put every serving option into the same denominator: successful requests that produced useful output.

Read the cost model

By Andrew Cooper, Founder of RunPlacement Updated May 2026 Provider-neutral, estimate-labeled guidance Verify current provider pricing

Infographic showing inference cost per request as total monthly serving cost divided by successful inference requests, with API, managed, GPU, retries, shared infrastructure, warm capacity, and operations overhead in the numerator and successful useful outputs in the denominator. — Inference cost per request is a reusable comparison unit for API, managed inference, and self-hosted GPU serving.

Right fit

You need one metric for API, managed inference, and self-hosted GPU.
A product margin discussion needs a simple cost unit.
Provider unit pricing is making the comparison hard to explain.

Quick checks

Define successful request consistently.
Separate retries and failed calls from successful output.
Include platform fees, warm capacity, shared infrastructure, and operations.

Rough math

Inference cost per request = total monthly serving cost / successful requests.
Total monthly serving cost = API or infrastructure cost + shared infrastructure + operations overhead.
Retry-adjusted cost rises when failed calls consume billable work.

Red flags

The denominator includes failed requests as if they created value.
The numerator excludes idle capacity or engineer time.
The team compares token price and GPU rate without normalizing to requests.

What to do next

Open the AI inference cost calculator.
Read the AI inference cost model.
Use API versus self-hosted inference when the formula exposes a serving-mode decision.
Use the AI inference cost checklist to collect the inputs.

Related resources

Use a worksheet before making the call

These supporting pages turn the decision into fields a buyer, engineer, or founder can actually compare.

AI inference costAI Inference Cost ChecklistChecklist / 8 sections / source-linked

A practical checklist for estimating AI inference cost across APIs, managed inference, self-hosted GPUs, batch jobs, realtime endpoints, and hybrid routing.

GPU pricingGPU Cloud Quote ChecklistChecklist / 7 sections / source-linked

A practical checklist and visual worksheet for comparing GPU cloud quotes beyond the advertised hourly rate.

Related decisions

Keep narrowing the placement question

Follow the adjacent pages when the first answer exposes a deeper cost driver or operating constraint.

AI inference costAI Cost Per Token: When Token Price Helps and When It MisleadsFormula guide

AI cost per token is useful for API estimates, but it can mislead when output length, retries, multi-step workflows, failed calls, or fixed serving capacity dominate cost.

AI inference costAI Cost Comparison: API, Managed Inference, GPU Cloud, and BatchCommercial comparison

A useful AI cost comparison compares serving categories by monthly cost, cost per successful request, latency, utilization, and operations burden, not by provider ranking.

AI inference costAPI vs Self-Hosted Inference: Which Costs Less?Commercial comparison

API inference usually wins for uncertain or low-volume workloads; self-hosted inference can win when volume, utilization, latency, or control needs justify GPU operations.

AI inference cost

When the GPU question is really serving cost

Use these pages when the same GPU quote, idle-cost, or useful GPU-hour question is about production inference rather than one-off training.

Estimator landing pageAI Cost CalculatorStart with broad AI cost, then narrow to API, managed inference, GPU, batch, realtime, or hybrid serving. Interactive calculatorAI Inference Cost CalculatorCompare API, managed inference, and self-hosted GPU cost per successful request. Decision pageAI Cost ComparisonCompare serving categories before ranking providers or quotes. Decision treeAPI vs Self-Hosted InferenceDecide when API simplicity, managed serving, or self-hosted GPU control fits. Optimization guideAI Cost OptimizationCheck output length, retries, routing, caching, batching, and utilization before rebuilding inference. Triage pageAI Costs IncreasingFind the driver before moving off APIs, switching platforms, or buying GPUs. Research guideRealtime vs Batch ResearchDecide when queueing, delay tolerance, and avoided warm capacity can change inference cost. Formula pageInference Cost Per RequestUse monthly serving cost divided by successful requests as the common comparison unit. FrameworkAI Inference Cost ModelNormalize serving options by monthly cost and successful requests.

Framework

Use the underlying decision model

These framework pages define the terms and formulas behind this specific decision.

AI inference costAI Inference Cost ModelAI inference cost

AI inference cost should be compared as effective cost per successful request and monthly serving cost, not just token price or GPU hourly rate.

AI inference cost quiz

Get an AI compute cost read

Use successful requests as the denominator, and include all serving costs in the numerator before comparing options.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.

Start the AI compute read

FAQ

What is inference cost per request?

Inference cost per request is total monthly serving cost divided by successful inference requests. The numerator should include API or infrastructure cost, idle capacity, storage, networking, observability, support, and operations overhead. The denominator should be requests that produced useful output, not every attempted call.

Should failed requests be included?

Failed requests should be measured because they can create cost, but they should not usually be counted as successful output. Track failures and retries in the numerator or as a separate waste rate, then use successful requests as the denominator for product margin and serving-mode comparisons.

Why use this formula?

Use the inference cost per request formula because it gives API, managed inference, and self-hosted GPU one comparison unit. It also exposes hidden drivers such as long outputs, retries, idle warm capacity, and engineering overhead. The formula is directional until populated with real logs, bills, or quotes.

Sources

AI inference cost quiz

Get an AI compute cost read

Use successful requests as the denominator, and include all serving costs in the numerator before comparing options.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.

Start the AI compute read