AI inference cost
Provider Pricing Page Field Audit
Short answer: Provider pricing pages are useful source material, but a placement decision still needs workload, capacity, data movement, and operations fields that may live outside the headline rate.
- This is a decision checklist, not a final price quote.
- Verify final numbers against provider pricing pages and your own bill or quote.
Next action
Replace defaults with source fields
Use official pricing and deployment pages to replace placeholders, then keep the estimate directional until it is checked against logs, bills, and quotes.
Use the checklistUse this when
- A team is replacing calculator defaults with official source material.
- A provider quote or pricing page is being compared against API, managed inference, batch, GPU cloud, or self-hosted GPU options.
- The visible unit price does not explain storage, transfer, minimum capacity, provisioned throughput, support, or operations.
Not for
- Provider rankings.
- Live rate tables.
- Benchmark or throughput claims.
- Contract review or procurement approval.
Worksheet Fields
Use this as the working version before copying the decision into a doc, ticket, or vendor email.
| Field | Capture | Why it matters |
|---|---|---|
| Token or request usage | Input units, output units, cached input, failed or filtered request rules, modality, and batch discounts. | API inference, serverless models, batch APIs. |
| Endpoint or compute time | Billable endpoint hours, instance hours, serverless duration, managed compute hours, and batch job duration. | Managed inference, SageMaker, GPU cloud, self-hosted GPU. |
| Capacity commitments | Minimum instances, provisioned throughput units, reservations, quota, region capacity, and commitment windows. | Provisioned throughput, managed platforms, self-hosted or direct GPU capacity. |
| Latency and batch mode | Standard, priority, realtime, async, batch, and queueing options. | Realtime inference, batch inference, managed inference, API inference. |
| Storage and networking | Payload storage, model artifacts, logs, persistent volumes, data transfer, egress, cross-region, and private endpoints. | Managed inference, GPU cloud, batch jobs, self-hosted serving. |
| Operations and support | Monitoring, rollback, support tier, incident owner, autoscaling, quotas, billing controls, and deployment limits. | Managed inference, GPU cloud, self-hosted GPU, production API workflows. |
Source audit
Copy The Provider Field Audit
Use this when replacing calculator defaults with official provider pages, bills, logs, or quotes. The goal is to verify fields, not rank providers.
Field Why it matters Where it appears or may need verification Serving modes affected RunPlacement next step Token or request units Turns usage into billable API or model calls Official pricing pages, API docs, tokenizer docs, logs API inference, serverless models, batch APIs AI cost per token Input, cached input, and output Input and output may price differently; caching can change the estimate Official model pricing pages, prompt caching docs, request logs API inference, managed model APIs Inference cost per request Failed or filtered request treatment Not every paid attempt becomes useful output Provider pricing notes, API logs, error handling docs API inference, managed inference LLM API bill too high Endpoint or compute hours Always-on endpoints and jobs may bill by time rather than token Deployment docs, pricing pages, cloud bills Managed inference, SageMaker, GPU cloud, self-hosted GPU GPU utilization for inference Serverless duration or request processing Intermittent workloads may avoid idle capacity but still have duration and request rules Serverless inference docs, pricing pages Serverless inference, API serving AI inference cost calculator Batch or async mode Queueing can change utilization, latency, and pricing mechanics Batch API docs, asynchronous inference docs, batch transform docs Batch inference, async inference, offline jobs Batch inference cost savings Minimum capacity Minimum instances or deployment sizes can create a fixed baseline Deployment docs, provider quote, quota pages Managed inference, provisioned throughput, GPU cloud Self-hosted inference break-even Provisioned throughput or reservation Reserved capacity changes cost tolerance and traffic requirements Provisioned throughput docs, reservations, quota pages Azure Foundry PTU, managed inference, high-scale production Managed inference vs GPU cloud Region or global scope Region, global routing, quota, and currency can change availability or costs Pricing page selectors, quota pages, deployment docs Cloud API, managed inference, GPU cloud Provider quote checklist Storage and artifacts Payloads, model artifacts, logs, snapshots, and volumes sit outside headline inference rates Cloud storage docs, endpoint docs, bills, quotes Managed inference, batch jobs, GPU cloud, self-hosted GPU AI inference cost checklist Data transfer and networking Ingress, egress, cross-region, private networking, and payload movement can change placement Cloud networking docs, bills, quotes Managed inference, GPU cloud, cloud migration Workload placement worksheet Support and operations Support tier, incident ownership, monitoring, rollback, and billing controls affect production cost Provider docs, support plans, quote terms, team ownership Managed inference, GPU cloud, self-hosted GPU API vs self-hosted inference
AI prompt
Prompt To Review Provider Pricing Fields
Paste provider pricing-page notes or quote excerpts into this prompt. It should find missing fields without making a provider recommendation.
You are helping me audit provider pricing-page fields for an AI inference cost estimate. Do not rank providers. Do not assume current rates. Do not make benchmark, latency, throughput, or reliability claims unless I provide source data. Here are the provider source notes: [Paste official pricing-page notes, deployment docs, quote terms, bill details, or logs here] Please: 1. Identify which fields are present: token/request units, input/output, endpoint hours, batch/async mode, minimum capacity, provisioned throughput, region, storage, networking, support, and operations. 2. Identify which fields are missing or need direct provider verification. 3. Explain which missing fields could change the calculator result or serving-mode decision. 4. Separate official pricing-page facts, provider quote terms, workload assumptions, and unknowns. 5. Recommend the next RunPlacement page to use: assumptions index, calculator, checklist, cost model, managed inference vs GPU cloud, batch vs realtime, or API vs self-hosted. 6. Keep the output provider-neutral and avoid current pricing, benchmark, or provider-ranking claims unless supplied in the source notes.
Short Answer
- Official provider pricing pages are necessary source material, but they do not always expose every field needed for a placement decision.
- A neutral field audit asks what must be verified, not which provider is cheapest.
- Use this page to replace calculator defaults with current provider pages, bills, logs, and quotes before making a buying decision.
What This Audit Compares
- It compares fields users should verify: token or request units, endpoint or compute time, capacity commitments, latency or batch mode, storage and networking, and operations or support.
- It does not compare current prices, publish a rate snapshot, or rank providers.
- It treats official pages as starting points and reminds teams to validate missing fields with bills, logs, quotes, and provider conversations.
Official Source Notes
- OpenAI exposes model pricing by token-like units and includes fields such as input, cached input, output, and batch or priority modes on its pricing surface.
- AWS SageMaker documentation separates real-time, serverless, asynchronous, and batch inference patterns; the right field depends on latency, payload size, and whether a persistent endpoint is needed.
- Google Cloud Generative AI pricing includes request, token, modality, cached-input, and batch-related fields; the source page also notes currency and SKU considerations.
- Azure Foundry separates serverless/pay-as-you-go model access, managed compute, batch options, and provisioned throughput; provisioned deployments reserve capacity whether or not requests are being made.
Trust Boundary
- Verify current rates directly on provider pages before buying, migrating, or presenting a business case.
- Use provider docs for source fields, not for a RunPlacement ranking.
- Use calculator results as directional estimates until populated with current source data and workload-specific logs.
FAQ
Does this audit say which provider is cheapest?
No. It identifies fields to verify on official pricing and deployment pages so teams can build their own current, workload-specific estimate.
Why avoid publishing provider rate tables?
Rates, model availability, regions, quotas, and terms change. A durable RunPlacement page should explain what to verify and where the field affects the estimate.
How should I use provider pricing pages with the calculator?
Use official pages to replace defaults for token usage, endpoint hours, batch mode, provisioned capacity, storage, network, and support assumptions, then verify the result against your own logs, bills, and quotes.
Sources
- https://platform.openai.com/docs/pricing
- https://aws.amazon.com/sagemaker/ai/pricing/
- https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html
- https://docs.aws.amazon.com/sagemaker/latest/dg/inference-cost-optimization.html
- https://docs.aws.amazon.com/sagemaker/latest/dg/hosting-faqs.html
- https://cloud.google.com/gemini-enterprise-agent-platform/generative-ai/pricing
- https://azure.microsoft.com/en-us/pricing/details/ai-foundry-models/aoai/
- https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/provisioned-throughput
AI inference cost quiz
Get an AI compute cost read
Use official pricing and deployment pages to collect source fields, then normalize the decision around monthly serving cost and cost per successful request.
Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.