AI compute cost tool

AI inference cost calculator

Short answer: Use this calculator to normalize API inference, managed inference, and self-hosted GPU serving into monthly cost and cost per successful request.

Estimate only
  • Defaults are hypothetical placeholders.
  • Replace prices with provider pages, bills, logs, or quotes before deciding.
  • No model quality, latency, throughput, or benchmark claims.
By Andrew Cooper, Founder of RunPlacement Updated May 2026 Provider-neutral, estimate-labeled guidance Verify current provider pricing

Scenario presets

Start from a common workload shape

Pick one, then replace the defaults with your current pricing, logs, bills, and quotes. The URL updates as you edit, so a scenario can be shared and reopened.

Traffic and usage
API inference
Managed inference
Self-hosted GPU

Sample output

Sample AI Compute Cost Read

Hypothetical production realtime LLM feature. This is the kind of directional read the calculator and worksheet are designed to produce, not a quote or benchmark.

Directional read

API may still be simplest until usage is less uncertain.

Managed or self-hosted serving only becomes plausible after volume, utilization, latency, and operations ownership are clearer.

Sensitive variables

  • Request volume and peak-to-average traffic.
  • Output size, retries, and failed calls.
  • Warm hours, useful utilization, and ops overhead.

Missing assumptions

  • Current provider pricing and API invoice data.
  • Latency target, real logs, failure rate, and quote terms.
  • Who owns deploys, monitoring, rollback, and incidents.

Next data to collect

  • 7-day request sample and p95 output size.
  • Retry/timeouts, managed serving quote, and GPU quote.
  • Ops owner and minimum reliability requirement.

Start here

Start with your situation

Pick the situation closest to what is happening right now. Each path leads to a calculator, worksheet, or decision page with the next useful step.

Use this calculator

How to treat the estimate

This is a directional calculator for API, managed inference, and self-hosted GPU serving cost. Default values are hypothetical placeholders, not current provider pricing or benchmark claims.

The formula visual below explains why API, managed inference, and self-hosted GPU should be normalized to successful requests. For a serious estimate, replace defaults with current provider pricing docs, observed request logs, bill data, and quotes. Useful references include OpenAI pricing, AWS SageMaker deployment docs, Vertex AI prediction docs, and NVIDIA DCGM docs.

Infographic showing inference cost per request as total monthly serving cost divided by successful inference requests, with API, managed, GPU, retries, shared infrastructure, warm capacity, and operations overhead in the numerator and successful useful outputs in the denominator.
Inference cost per request is a reusable comparison unit for API, managed inference, and self-hosted GPU serving.

How to use the calculator

Start with logs or a realistic launch forecast. Put current provider pricing into the API fields, quote terms into the managed fields, and the actual GPU offer into the self-hosted fields. Treat the result as a shortlist filter, not a final procurement model.

What the comparison includes

  • API token-style usage with retry allowance.
  • Managed inference minimum serving hours, instance count, and platform fees.
  • Self-hosted GPU warm capacity, utilization, storage, networking, observability, and engineering overhead.

What can change the answer

  • Batching, caching, small-model routing, or prompt reduction can lower API spend before infrastructure changes.
  • Higher utilization can make direct GPU capacity more plausible.
  • Strict realtime latency can make warm capacity unavoidable.
  • Low operations tolerance can make managed inference cheaper in practice even when raw infrastructure looks cheaper.

Useful sources to replace defaults

Next pages

Turn the estimate into a decision

Use the model and decision pages once the calculator exposes the sensitive variables.

RunPlacement quiz

Pressure-test this workload

Use the calculator to find the sensitive variables, then run the quiz with the actual workload shape.

Uses workload type, budget, GPU need, data movement, priority, and ops tolerance.
Use the quiz