AI inference cost / Cost estimation

Self-Hosted LLM Inference Cost: What to Include

Short answer: The GPU hourly rate is only the starting point for self-hosted LLM inference cost; warm capacity, utilization, storage, networking, monitoring, reliability, upgrades, and team time all belong in the estimate.

Decision rule

Call self-hosting cheaper only when total monthly serving cost per successful request beats API or managed inference after operations are included.
Verify current provider pricing directly before buying or migrating.

Next action

Include the full serving stack

Use this page when GPU hourly rate is only one line in the estimate. Then run the calculator with warm hours, utilization, shared infrastructure, and operations overhead included.

Run the calculator

By Andrew Cooper, Founder of RunPlacement Updated May 2026 Provider-neutral, estimate-labeled guidance Verify current provider pricing

Right fit

You have high or steady inference volume.
Model/runtime control, data control, or latency requirements matter.
The team can own deployment, monitoring, rollback, and incidents.

Quick checks

Estimate warm GPU hours and useful utilization.
Include storage, network transfer, observability, and support tools.
Assign real engineering hours for serving operations and incidents.

Rough math

Self-hosted monthly cost = GPU hours + shared infrastructure + operations overhead.
Useful GPU-hour cost = GPU spend / useful inference GPU-hours.
Effective request cost = total monthly serving cost / successful requests.

Red flags

The estimate compares API token spend to GPU hourly rate alone.
Utilization is unknown or low.
No one owns model serving incidents, upgrades, or rollback.

What to do next

Open the AI inference cost calculator.
Read the AI inference cost model.
Compare against API versus self-hosted inference.
Use the AI inference cost checklist before buying capacity.

Related resources

Use a worksheet before making the call

These supporting pages turn the decision into fields a buyer, engineer, or founder can actually compare.

AI inference costAI Inference Cost ChecklistChecklist / 8 sections / source-linked

A practical checklist for estimating AI inference cost across APIs, managed inference, self-hosted GPUs, batch jobs, realtime endpoints, and hybrid routing.

AI inference costAI Inference Cost Assumptions IndexResearch index / 4 sections / source-linked

A source-backed index of the workload assumptions to collect before estimating API, managed inference, batch, GPU cloud, or self-hosted GPU cost.

AI inference costRealtime vs Batch Inference Cost Research GuideResearch guide / 7 sections / source-linked

A source-backed guide to deciding when realtime, asynchronous, batch, or hybrid inference changes effective AI serving cost.

Related decisions

Keep narrowing the placement question

Follow the adjacent pages when the first answer exposes a deeper cost driver or operating constraint.

AI inference costAPI vs Self-Hosted Inference: Which Costs Less?Commercial comparison

API inference usually wins for uncertain or low-volume workloads; self-hosted inference can win when volume, utilization, latency, or control needs justify GPU operations.

AI inference costSelf-Hosted Inference Break-Even: Directional FrameworkBreak-even framework

Self-hosted inference reaches break-even only when optimized API or managed cost is higher than fully loaded GPU serving cost at realistic utilization.

AI inference costGPU Utilization for Inference: Why Useful Hours MatterCost explanation

GPU utilization matters for inference because paid warm capacity can sit idle between requests, peaks, batches, deploys, or failures.

AI inference cost

When the GPU question is really serving cost

Use these pages when the same GPU quote, idle-cost, or useful GPU-hour question is about production inference rather than one-off training.

Estimator landing pageAI Cost CalculatorStart with broad AI cost, then narrow to API, managed inference, GPU, batch, realtime, or hybrid serving. Interactive calculatorAI Inference Cost CalculatorCompare API, managed inference, and self-hosted GPU cost per successful request. Decision pageAI Cost ComparisonCompare serving categories before ranking providers or quotes. Decision treeAPI vs Self-Hosted InferenceDecide when API simplicity, managed serving, or self-hosted GPU control fits. Optimization guideAI Cost OptimizationCheck output length, retries, routing, caching, batching, and utilization before rebuilding inference. Triage pageAI Costs IncreasingFind the driver before moving off APIs, switching platforms, or buying GPUs. Research guideRealtime vs Batch ResearchDecide when queueing, delay tolerance, and avoided warm capacity can change inference cost. Formula pageInference Cost Per RequestUse monthly serving cost divided by successful requests as the common comparison unit. FrameworkAI Inference Cost ModelNormalize serving options by monthly cost and successful requests.

Framework

Use the underlying decision model

These framework pages define the terms and formulas behind this specific decision.

AI inference costAI Inference Cost ModelAI inference cost

AI inference cost should be compared as effective cost per successful request and monthly serving cost, not just token price or GPU hourly rate.

AI inference cost quiz

Get an AI compute cost read

Call self-hosting cheaper only when total monthly serving cost per successful request beats API or managed inference after operations are included.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.

Start the AI compute read

FAQ

What should self-hosted LLM inference cost include?

Self-hosted LLM inference cost should include GPU hours, idle capacity, utilization, storage, networking, observability, support tooling, deployment work, upgrades, rollback, and incident ownership. It should also include shared infrastructure and engineering time, because those costs often decide whether self-hosting is actually cheaper.

When is self-hosted LLM inference worth considering?

Self-hosted LLM inference is worth considering when volume is high, utilization is predictable, control matters, and the team can operate the serving stack safely. It is less compelling when demand is uncertain, outputs are still changing, or the cost issue can be reduced with caching, routing, or batching.

What is the common self-hosting cost mistake?

The common self-hosting cost mistake is excluding idle GPU capacity and engineering overhead. A GPU hourly rate can look cheaper than API usage while warm capacity, low utilization, storage, networking, monitoring, upgrades, and incident response make the cost per successful request higher.

Sources

AI inference cost quiz

Get an AI compute cost read

Call self-hosting cheaper only when total monthly serving cost per successful request beats API or managed inference after operations are included.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.

Start the AI compute read