AI inference cost / Formula

Inference Cost Per Request: Simple Formula

Short answer: A useful inference cost per request starts with total monthly serving cost divided by successful inference requests, with failed calls and retries handled explicitly.

Decision rule
  • Use successful requests as the denominator, and include all serving costs in the numerator before comparing options.
  • Verify current provider pricing directly before buying or migrating.
By Andrew Cooper, Founder of RunPlacement Updated May 2026 Provider-neutral, estimate-labeled guidance Verify current provider pricing
Infographic showing inference cost per request as total monthly serving cost divided by successful inference requests, with API, managed, GPU, retries, shared infrastructure, warm capacity, and operations overhead in the numerator and successful useful outputs in the denominator.
Inference cost per request is a reusable comparison unit for API, managed inference, and self-hosted GPU serving.

Right fit

  • You need one metric for API, managed inference, and self-hosted GPU.
  • A product margin discussion needs a simple cost unit.
  • Provider unit pricing is making the comparison hard to explain.

Quick checks

  • Define successful request consistently.
  • Separate retries and failed calls from successful output.
  • Include platform fees, warm capacity, shared infrastructure, and operations.

Rough math

  • Inference cost per request = total monthly serving cost / successful requests.
  • Total monthly serving cost = API or infrastructure cost + shared infrastructure + operations overhead.
  • Retry-adjusted cost rises when failed calls consume billable work.

Red flags

  • The denominator includes failed requests as if they created value.
  • The numerator excludes idle capacity or engineer time.
  • The team compares token price and GPU rate without normalizing to requests.

What to do next

  • Open the AI inference cost calculator.
  • Read the AI inference cost model.
  • Use API versus self-hosted inference when the formula exposes a serving-mode decision.
  • Use the AI inference cost checklist to collect the inputs.

Related resources

Use a worksheet before making the call

These supporting pages turn the decision into fields a buyer, engineer, or founder can actually compare.

Related decisions

Keep narrowing the placement question

Follow the adjacent pages when the first answer exposes a deeper cost driver or operating constraint.

Framework

Use the underlying decision model

These framework pages define the terms and formulas behind this specific decision.

AI inference cost quiz

Get an AI compute cost read

Use successful requests as the denominator, and include all serving costs in the numerator before comparing options.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.
Start the AI compute read

FAQ

What is inference cost per request?

It is total monthly serving cost divided by successful inference requests.

Should failed requests be included?

Failed or retried calls should be measured because they can create cost, but the main denominator should be successful requests that produced useful output.

Why use this formula?

It gives API, managed inference, and self-hosted GPU a common unit for product margin and infrastructure decisions.

Sources

AI inference cost quiz

Get an AI compute cost read

Use successful requests as the denominator, and include all serving costs in the numerator before comparing options.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.
Start the AI compute read