AI inference cost / RunPlacement framework

AI Inference Cost Model

Direct answer: AI inference cost should be compared as effective cost per successful request and monthly serving cost, not just token price or GPU hourly rate.

Decision rule
  • Use APIs when usage is uncertain or ops tolerance is low; consider managed or self-hosted GPUs when volume, latency, data control, or model requirements justify the overhead.
  • Use provider pricing pages and your own bill or quote before making a purchase or migration decision.
By Andrew Cooper, Founder of RunPlacement Updated May 2026 Provider-neutral, estimate-labeled guidance Verify current provider pricing

Definition

AI inference cost

AI inference cost is the total cost of serving model outputs, including token/API charges or GPU time plus idle capacity, storage, networking, observability, reliability, and engineering overhead.

Effective inference cost = total monthly serving cost / successful inference requests.
Infographic showing inference cost per request as total monthly serving cost divided by successful inference requests, with API, managed, GPU, retries, shared infrastructure, warm capacity, and operations overhead in the numerator and successful useful outputs in the denominator.
Inference cost per request is a reusable comparison unit for API, managed inference, and self-hosted GPU serving.

Key idea

How to use the formula

AI inference cost should be compared as effective cost per successful request and monthly serving cost, not just token price or GPU hourly rate.

Use APIs when usage is uncertain or ops tolerance is low; consider managed or self-hosted GPUs when volume, latency, data control, or model requirements justify the overhead.

Example scenarios

Low-traffic prototype

An API can be the right starting point because there is no idle GPU baseline to carry.

High-volume batch inference

Batch jobs can justify GPU or managed batch serving when utilization is high and latency is flexible.

Production realtime endpoint

A self-hosted endpoint can lose if it keeps expensive warm capacity idle for traffic spikes.

Hybrid routing and caching

Cache hits, small-model routing, and batch queues can reduce cost without forcing every request through the largest model.

Decision Table

OptionBest useRisk
API inferenceFast start, no GPU operations, clear usage billingCan become expensive at scale or with long outputs
Managed inferenceAutoscaling, batching, reliability helpCan hide platform premium and utilization assumptions
Self-hosted GPUMore control and possible scale economicsAdds idle capacity, networking, storage, and engineering work
Batch inferenceHigher utilization and flexible schedulingNot suitable for strict realtime latency

Practical companion

Turn the model into worksheet fields

The framework defines the comparison unit. The checklist captures the request, latency, utilization, warm-capacity, and operations inputs that make the formula usable.

AI inference cost quiz

Get an AI compute cost read

Use APIs when usage is uncertain or ops tolerance is low; consider managed or self-hosted GPUs when volume, latency, data control, or model requirements justify the overhead.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.
Start the AI compute read

Related decisions

Apply the framework

Use these long-tail decision pages when a specific cost driver or provider choice is already visible.

AI inference costAPI vs Self-Hosted Inference: Which Costs Less?Commercial comparison

API inference usually wins for uncertain or low-volume workloads; self-hosted inference can win when volume, utilization, latency, or control needs justify GPU operations.

AI inference costSelf-Hosted LLM Inference Cost: What to IncludeCost estimation

The GPU hourly rate is only the starting point for self-hosted LLM inference cost; warm capacity, utilization, storage, networking, monitoring, reliability, upgrades, and team time all belong in the estimate.

AI inference costLLM API Bill Too High? What to Check FirstCost triage

A high LLM API bill is usually a triage problem first: check whether output size, retries, tool calls, caching gaps, routing, or batchable work are driving the increase.

AI inference costInference Cost Per Request: Simple FormulaFormula

A useful inference cost per request starts with total monthly serving cost divided by successful inference requests, with failed calls and retries handled explicitly.

AI inference costBatch vs Realtime Inference Cost: How to ChooseCost estimation

Batch inference is often cheaper when latency is flexible because work can be queued for higher utilization; realtime inference costs more when warm capacity and strict latency are required.

AI inference costManaged Inference vs GPU Cloud: Cost and Control TradeoffsCommercial comparison

Managed inference can cost more on paper but win when autoscaling, batching, reliability, and lower ops burden reduce effective inference cost.

GPU pricingGPU Cloud Idle Cost: How to Price Wasted Accelerator TimeCost estimation

GPU cloud idle cost is the gap between paid accelerator time and useful workload progress. It matters most for training retries, batch queues, and inference fleets with low baseline utilization.

Related resources

Turn the framework into a worksheet

These checklists make the concept easier to share and apply.

FAQ

How should I estimate AI inference cost?

Estimate input and output usage, traffic pattern, latency needs, idle capacity, storage, networking, observability, and engineering overhead, then divide total monthly serving cost by successful requests.

When is API inference cheaper than self-hosting?

API inference is often cheaper when traffic is uncertain, volume is low, operations tolerance is low, or the team does not need deep model/runtime control.

When can self-hosted inference make sense?

Self-hosted inference can make sense when volume is high, utilization is predictable, model/runtime control matters, and the team can operate the serving stack safely.

Sources

AI inference cost quiz

Get an AI compute cost read

Use APIs when usage is uncertain or ops tolerance is low; consider managed or self-hosted GPUs when volume, latency, data control, or model requirements justify the overhead.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.
Start the AI compute read