AI inference cost

Provider Pricing Page Field Audit

Short answer: Provider pricing pages are useful source material, but a placement decision still needs workload, capacity, data movement, and operations fields that may live outside the headline rate.

Estimate only

This is a decision checklist, not a final price quote.
Verify final numbers against provider pricing pages and your own bill or quote.

Next action

Replace defaults with source fields

Use official pricing and deployment pages to replace placeholders, then keep the estimate directional until it is checked against logs, bills, and quotes.

Use the checklist

By Andrew Cooper, Founder of RunPlacement Updated May 2026 Provider-neutral, estimate-labeled guidance Verify current provider pricing

Use this when

A team is replacing calculator defaults with official source material.
A provider quote or pricing page is being compared against API, managed inference, batch, GPU cloud, or self-hosted GPU options.
The visible unit price does not explain storage, transfer, minimum capacity, provisioned throughput, support, or operations.

Not for

Provider rankings.
Live rate tables.
Benchmark or throughput claims.
Contract review or procurement approval.

Worksheet Fields

Use this as the working version before copying the decision into a doc, ticket, or vendor email.

Field	Capture	Why it matters
Token or request usage	Input units, output units, cached input, failed or filtered request rules, modality, and batch discounts.	API inference, serverless models, batch APIs.
Endpoint or compute time	Billable endpoint hours, instance hours, serverless duration, managed compute hours, and batch job duration.	Managed inference, SageMaker, GPU cloud, self-hosted GPU.
Capacity commitments	Minimum instances, provisioned throughput units, reservations, quota, region capacity, and commitment windows.	Provisioned throughput, managed platforms, self-hosted or direct GPU capacity.
Latency and batch mode	Standard, priority, realtime, async, batch, and queueing options.	Realtime inference, batch inference, managed inference, API inference.
Storage and networking	Payload storage, model artifacts, logs, persistent volumes, data transfer, egress, cross-region, and private endpoints.	Managed inference, GPU cloud, batch jobs, self-hosted serving.
Operations and support	Monitoring, rollback, support tier, incident owner, autoscaling, quotas, billing controls, and deployment limits.	Managed inference, GPU cloud, self-hosted GPU, production API workflows.

Source audit

Copy The Provider Field Audit

Use this when replacing calculator defaults with official provider pages, bills, logs, or quotes. The goal is to verify fields, not rank providers.

Field	Why it matters	Where it appears or may need verification	Serving modes affected	RunPlacement next step
Token or request units	Turns usage into billable API or model calls	Official pricing pages, API docs, tokenizer docs, logs	API inference, serverless models, batch APIs	AI cost per token
Input, cached input, and output	Input and output may price differently; caching can change the estimate	Official model pricing pages, prompt caching docs, request logs	API inference, managed model APIs	Inference cost per request
Failed or filtered request treatment	Not every paid attempt becomes useful output	Provider pricing notes, API logs, error handling docs	API inference, managed inference	LLM API bill too high
Endpoint or compute hours	Always-on endpoints and jobs may bill by time rather than token	Deployment docs, pricing pages, cloud bills	Managed inference, SageMaker, GPU cloud, self-hosted GPU	GPU utilization for inference
Serverless duration or request processing	Intermittent workloads may avoid idle capacity but still have duration and request rules	Serverless inference docs, pricing pages	Serverless inference, API serving	AI inference cost calculator
Batch or async mode	Queueing can change utilization, latency, and pricing mechanics	Batch API docs, asynchronous inference docs, batch transform docs	Batch inference, async inference, offline jobs	Batch inference cost savings
Minimum capacity	Minimum instances or deployment sizes can create a fixed baseline	Deployment docs, provider quote, quota pages	Managed inference, provisioned throughput, GPU cloud	Self-hosted inference break-even
Provisioned throughput or reservation	Reserved capacity changes cost tolerance and traffic requirements	Provisioned throughput docs, reservations, quota pages	Azure Foundry PTU, managed inference, high-scale production	Managed inference vs GPU cloud
Region or global scope	Region, global routing, quota, and currency can change availability or costs	Pricing page selectors, quota pages, deployment docs	Cloud API, managed inference, GPU cloud	Provider quote checklist
Storage and artifacts	Payloads, model artifacts, logs, snapshots, and volumes sit outside headline inference rates	Cloud storage docs, endpoint docs, bills, quotes	Managed inference, batch jobs, GPU cloud, self-hosted GPU	AI inference cost checklist
Data transfer and networking	Ingress, egress, cross-region, private networking, and payload movement can change placement	Cloud networking docs, bills, quotes	Managed inference, GPU cloud, cloud migration	Workload placement worksheet
Support and operations	Support tier, incident ownership, monitoring, rollback, and billing controls affect production cost	Provider docs, support plans, quote terms, team ownership	Managed inference, GPU cloud, self-hosted GPU	API vs self-hosted inference

Field	Why it matters	Where it appears or may need verification	Serving modes affected	RunPlacement next step
Token or request units	Turns usage into billable API or model calls	Official pricing pages, API docs, tokenizer docs, logs	API inference, serverless models, batch APIs	AI cost per token
Input, cached input, and output	Input and output may price differently; caching can change the estimate	Official model pricing pages, prompt caching docs, request logs	API inference, managed model APIs	Inference cost per request
Failed or filtered request treatment	Not every paid attempt becomes useful output	Provider pricing notes, API logs, error handling docs	API inference, managed inference	LLM API bill too high
Endpoint or compute hours	Always-on endpoints and jobs may bill by time rather than token	Deployment docs, pricing pages, cloud bills	Managed inference, SageMaker, GPU cloud, self-hosted GPU	GPU utilization for inference
Serverless duration or request processing	Intermittent workloads may avoid idle capacity but still have duration and request rules	Serverless inference docs, pricing pages	Serverless inference, API serving	AI inference cost calculator
Batch or async mode	Queueing can change utilization, latency, and pricing mechanics	Batch API docs, asynchronous inference docs, batch transform docs	Batch inference, async inference, offline jobs	Batch inference cost savings
Minimum capacity	Minimum instances or deployment sizes can create a fixed baseline	Deployment docs, provider quote, quota pages	Managed inference, provisioned throughput, GPU cloud	Self-hosted inference break-even
Provisioned throughput or reservation	Reserved capacity changes cost tolerance and traffic requirements	Provisioned throughput docs, reservations, quota pages	Azure Foundry PTU, managed inference, high-scale production	Managed inference vs GPU cloud
Region or global scope	Region, global routing, quota, and currency can change availability or costs	Pricing page selectors, quota pages, deployment docs	Cloud API, managed inference, GPU cloud	Provider quote checklist
Storage and artifacts	Payloads, model artifacts, logs, snapshots, and volumes sit outside headline inference rates	Cloud storage docs, endpoint docs, bills, quotes	Managed inference, batch jobs, GPU cloud, self-hosted GPU	AI inference cost checklist
Data transfer and networking	Ingress, egress, cross-region, private networking, and payload movement can change placement	Cloud networking docs, bills, quotes	Managed inference, GPU cloud, cloud migration	Workload placement worksheet
Support and operations	Support tier, incident ownership, monitoring, rollback, and billing controls affect production cost	Provider docs, support plans, quote terms, team ownership	Managed inference, GPU cloud, self-hosted GPU	API vs self-hosted inference

AI prompt

Prompt To Review Provider Pricing Fields

Paste provider pricing-page notes or quote excerpts into this prompt. It should find missing fields without making a provider recommendation.

You are helping me audit provider pricing-page fields for an AI inference cost estimate. Do not rank providers. Do not assume current rates. Do not make benchmark, latency, throughput, or reliability claims unless I provide source data.

Here are the provider source notes:
[Paste official pricing-page notes, deployment docs, quote terms, bill details, or logs here]

Please:
1. Identify which fields are present: token/request units, input/output, endpoint hours, batch/async mode, minimum capacity, provisioned throughput, region, storage, networking, support, and operations.
2. Identify which fields are missing or need direct provider verification.
3. Explain which missing fields could change the calculator result or serving-mode decision.
4. Separate official pricing-page facts, provider quote terms, workload assumptions, and unknowns.
5. Recommend the next RunPlacement page to use: assumptions index, calculator, checklist, cost model, managed inference vs GPU cloud, batch vs realtime, or API vs self-hosted.
6. Keep the output provider-neutral and avoid current pricing, benchmark, or provider-ranking claims unless supplied in the source notes.

You are helping me audit provider pricing-page fields for an AI inference cost estimate. Do not rank providers. Do not assume current rates. Do not make benchmark, latency, throughput, or reliability claims unless I provide source data.

Here are the provider source notes:
[Paste official pricing-page notes, deployment docs, quote terms, bill details, or logs here]

Please:
1. Identify which fields are present: token/request units, input/output, endpoint hours, batch/async mode, minimum capacity, provisioned throughput, region, storage, networking, support, and operations.
2. Identify which fields are missing or need direct provider verification.
3. Explain which missing fields could change the calculator result or serving-mode decision.
4. Separate official pricing-page facts, provider quote terms, workload assumptions, and unknowns.
5. Recommend the next RunPlacement page to use: assumptions index, calculator, checklist, cost model, managed inference vs GPU cloud, batch vs realtime, or API vs self-hosted.
6. Keep the output provider-neutral and avoid current pricing, benchmark, or provider-ranking claims unless supplied in the source notes.

Short Answer

Official provider pricing pages are necessary source material, but they do not always expose every field needed for a placement decision.
A neutral field audit asks what must be verified, not which provider is cheapest.
Use this page to replace calculator defaults with current provider pages, bills, logs, and quotes before making a buying decision.

What This Audit Compares

It compares fields users should verify: token or request units, endpoint or compute time, capacity commitments, latency or batch mode, storage and networking, and operations or support.
It does not compare current prices, publish a rate snapshot, or rank providers.
It treats official pages as starting points and reminds teams to validate missing fields with bills, logs, quotes, and provider conversations.

Official Source Notes

OpenAI exposes model pricing by token-like units and includes fields such as input, cached input, output, and batch or priority modes on its pricing surface.
AWS SageMaker documentation separates real-time, serverless, asynchronous, and batch inference patterns; the right field depends on latency, payload size, and whether a persistent endpoint is needed.
Google Cloud Generative AI pricing includes request, token, modality, cached-input, and batch-related fields; the source page also notes currency and SKU considerations.
Azure Foundry separates serverless/pay-as-you-go model access, managed compute, batch options, and provisioned throughput; provisioned deployments reserve capacity whether or not requests are being made.

Trust Boundary

Verify current rates directly on provider pages before buying, migrating, or presenting a business case.
Use provider docs for source fields, not for a RunPlacement ranking.
Use calculator results as directional estimates until populated with current source data and workload-specific logs.

FAQ

Does this audit say which provider is cheapest?

No. It identifies fields to verify on official pricing and deployment pages so teams can build their own current, workload-specific estimate.

Why avoid publishing provider rate tables?

Rates, model availability, regions, quotas, and terms change. A durable RunPlacement page should explain what to verify and where the field affects the estimate.

How should I use provider pricing pages with the calculator?

Use official pages to replace defaults for token usage, endpoint hours, batch mode, provisioned capacity, storage, network, and support assumptions, then verify the result against your own logs, bills, and quotes.

Sources

AI inference cost quiz

Get an AI compute cost read

Use official pricing and deployment pages to collect source fields, then normalize the decision around monthly serving cost and cost per successful request.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.

Start the AI compute read