AI inference cost / RunPlacement framework

AI Inference Cost Model

Direct answer: AI inference cost should be compared as effective cost per successful request and monthly serving cost, not just token price or GPU hourly rate.

Decision rule

Use APIs when usage is uncertain or ops tolerance is low; consider managed or self-hosted GPUs when volume, latency, data control, or model requirements justify the overhead.
Use provider pricing pages and your own bill or quote before making a purchase or migration decision.

Next action

Use the formula

Start with total monthly serving cost divided by successful requests. Then compare API, managed inference, and self-hosted GPU only after idle capacity and operations are included.

Open calculator

By Andrew Cooper, Founder of RunPlacement Updated May 2026 Provider-neutral, estimate-labeled guidance Verify current provider pricing

Definition

AI inference cost

AI inference cost is the total cost of serving model outputs, including token/API charges or GPU time plus idle capacity, storage, networking, observability, reliability, and engineering overhead.

Effective inference cost = total monthly serving cost / successful inference requests.

Infographic showing inference cost per request as total monthly serving cost divided by successful inference requests, with API, managed, GPU, retries, shared infrastructure, warm capacity, and operations overhead in the numerator and successful useful outputs in the denominator. — Inference cost per request is a reusable comparison unit for API, managed inference, and self-hosted GPU serving.

Key idea

How to use the formula

AI inference cost should be compared as effective cost per successful request and monthly serving cost, not just token price or GPU hourly rate.

Use APIs when usage is uncertain or ops tolerance is low; consider managed or self-hosted GPUs when volume, latency, data control, or model requirements justify the overhead.

Example scenarios

Low-traffic prototype

An API can be the right starting point because there is no idle GPU baseline to carry.

High-volume batch inference

Batch jobs can justify GPU or managed batch serving when utilization is high and latency is flexible.

Production realtime endpoint

A self-hosted endpoint can lose if it keeps expensive warm capacity idle for traffic spikes.

Hybrid routing and caching

Cache hits, small-model routing, and batch queues can reduce cost without forcing every request through the largest model.

Decision Table

Option	Best use	Risk
API inference	Fast start, no GPU operations, clear usage billing	Can become expensive at scale or with long outputs
Managed inference	Autoscaling, batching, reliability help	Can hide platform premium and utilization assumptions
Self-hosted GPU	More control and possible scale economics	Adds idle capacity, networking, storage, and engineering work
Batch inference	Higher utilization and flexible scheduling	Not suitable for strict realtime latency

Practical companion

Turn the model into worksheet fields

The framework defines the comparison unit. The checklist captures the request, latency, utilization, warm-capacity, and operations inputs that make the formula usable.

WorksheetAI Inference Cost ChecklistCollect the fields before comparing API, managed inference, and self-hosted GPU. Research indexAssumptions IndexDefine which workload assumptions feed the formula before comparing serving modes. Source auditProvider Field AuditUse official pricing pages to replace defaults without copying stale rate tables. Interactive toolAI Inference Cost CalculatorUse the checklist fields to estimate directional monthly cost and cost per successful request.

AI inference cost quiz

Get an AI compute cost read

Use APIs when usage is uncertain or ops tolerance is low; consider managed or self-hosted GPUs when volume, latency, data control, or model requirements justify the overhead.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.

Start the AI compute read

Related decisions

Apply the framework

Use these long-tail decision pages when a specific cost driver or provider choice is already visible.

AI inference costAI Cost Comparison: API, Managed Inference, GPU Cloud, and BatchCommercial comparison

A useful AI cost comparison compares serving categories by monthly cost, cost per successful request, latency, utilization, and operations burden, not by provider ranking.

AI inference costAI Cost Per Token: When Token Price Helps and When It MisleadsFormula guide

AI cost per token is useful for API estimates, but it can mislead when output length, retries, multi-step workflows, failed calls, or fixed serving capacity dominate cost.

AI inference costAI Costs Increasing? A Triage Checklist Before You MigrateCost triage

When AI costs increase, first separate normal usage growth from waste: longer outputs, retries, failed calls, tool loops, poor routing, missing caching, and always-warm capacity.

AI inference costAI Cost Optimization: Practical Levers Before Rebuilding InferenceOptimization guide

AI cost optimization usually starts with usage shape: reduce avoidable output, retries, failed calls, over-large prompts, expensive routing, and low utilization before changing infrastructure.

AI inference costAPI vs Self-Hosted Inference: Which Costs Less?Commercial comparison

API inference usually wins for uncertain or low-volume workloads; self-hosted inference can win when volume, utilization, latency, or control needs justify GPU operations.

AI inference costSelf-Hosted LLM Inference Cost: What to IncludeCost estimation

The GPU hourly rate is only the starting point for self-hosted LLM inference cost; warm capacity, utilization, storage, networking, monitoring, reliability, upgrades, and team time all belong in the estimate.

AI inference costSelf-Hosted Inference Break-Even: Directional FrameworkBreak-even framework

Self-hosted inference reaches break-even only when optimized API or managed cost is higher than fully loaded GPU serving cost at realistic utilization.

AI inference costLLM API Bill Too High? What to Check FirstCost triage

A high LLM API bill is usually a triage problem first: check whether output size, retries, tool calls, caching gaps, routing, or batchable work are driving the increase.

AI inference costInference Cost Per Request: Simple FormulaFormula

A useful inference cost per request starts with total monthly serving cost divided by successful inference requests, with failed calls and retries handled explicitly.

AI inference costBatch vs Realtime Inference Cost: How to ChooseCost estimation

Batch inference is often cheaper when latency is flexible because work can be queued for higher utilization; realtime inference costs more when warm capacity and strict latency are required.

AI inference costBatch Inference Cost Savings: When Queueing HelpsCost optimization

Batch inference can reduce cost when the work can wait, queueing raises utilization, and the system avoids always-warm realtime capacity.

AI inference costManaged Inference vs GPU Cloud: Cost and Control TradeoffsCommercial comparison

Managed inference can cost more on paper but win when autoscaling, batching, reliability, and lower ops burden reduce effective inference cost.

GPU pricingGPU Cloud Idle Cost: How to Price Wasted Accelerator TimeCost estimation

GPU cloud idle cost is the gap between paid accelerator time and useful workload progress. It matters most for training retries, batch queues, and inference fleets with low baseline utilization.

AI inference cost

When the GPU question is really serving cost

Use these pages when the same GPU quote, idle-cost, or useful GPU-hour question is about production inference rather than one-off training.

Estimator landing pageAI Cost CalculatorStart with broad AI cost, then narrow to API, managed inference, GPU, batch, realtime, or hybrid serving. Interactive calculatorAI Inference Cost CalculatorCompare API, managed inference, and self-hosted GPU cost per successful request. Decision pageAI Cost ComparisonCompare serving categories before ranking providers or quotes. Decision treeAPI vs Self-Hosted InferenceDecide when API simplicity, managed serving, or self-hosted GPU control fits. Optimization guideAI Cost OptimizationCheck output length, retries, routing, caching, batching, and utilization before rebuilding inference. Triage pageAI Costs IncreasingFind the driver before moving off APIs, switching platforms, or buying GPUs. Research guideRealtime vs Batch ResearchDecide when queueing, delay tolerance, and avoided warm capacity can change inference cost. Formula pageInference Cost Per RequestUse monthly serving cost divided by successful requests as the common comparison unit. FrameworkAI Inference Cost ModelNormalize serving options by monthly cost and successful requests.

Related resources

Turn the framework into a worksheet

These checklists make the concept easier to share and apply.

AI inference costAI Inference Cost ChecklistChecklist / 8 sections / source-linked

A practical checklist for estimating AI inference cost across APIs, managed inference, self-hosted GPUs, batch jobs, realtime endpoints, and hybrid routing.

AI inference costAI Inference Cost Assumptions IndexResearch index / 4 sections / source-linked

A source-backed index of the workload assumptions to collect before estimating API, managed inference, batch, GPU cloud, or self-hosted GPU cost.

AI inference costRealtime vs Batch Inference Cost Research GuideResearch guide / 7 sections / source-linked

A source-backed guide to deciding when realtime, asynchronous, batch, or hybrid inference changes effective AI serving cost.

FAQ

How should I estimate AI inference cost?

Estimate AI inference cost by calculating total monthly serving cost and cost per successful request. Include input and output usage, traffic pattern, retries, latency needs, idle capacity, storage, networking, observability, reliability work, and engineering overhead. Replace hypothetical defaults with current provider pricing, bills, logs, and quotes.

When is API inference cheaper than self-hosting?

API inference is usually cheaper when traffic is low, uncertain, bursty, or still changing. It also fits when the team has low operations tolerance or does not need deep model/runtime control. Self-hosting only becomes plausible after optimized API cost is compared with fully loaded GPU serving cost.

When can self-hosted inference make sense?

Self-hosted inference can make sense when volume is high, utilization is predictable, control matters, and the team can operate deployment, monitoring, upgrades, rollback, and incidents. The estimate should include idle warm capacity, shared infrastructure, storage, networking, observability, and engineering time before calling it cheaper.

Sources

AI inference cost quiz

Get an AI compute cost read

Use APIs when usage is uncertain or ops tolerance is low; consider managed or self-hosted GPUs when volume, latency, data control, or model requirements justify the overhead.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.

Start the AI compute read