AI inference cost / Cost estimation

Self-Hosted LLM Inference Cost: What to Include

Short answer: The GPU hourly rate is only the starting point for self-hosted LLM inference cost; warm capacity, utilization, storage, networking, monitoring, reliability, upgrades, and team time all belong in the estimate.

Decision rule
  • Call self-hosting cheaper only when total monthly serving cost per successful request beats API or managed inference after operations are included.
  • Verify current provider pricing directly before buying or migrating.
By Andrew Cooper, Founder of RunPlacement Updated May 2026 Provider-neutral, estimate-labeled guidance Verify current provider pricing

Right fit

  • You have high or steady inference volume.
  • Model/runtime control, data control, or latency requirements matter.
  • The team can own deployment, monitoring, rollback, and incidents.

Quick checks

  • Estimate warm GPU hours and useful utilization.
  • Include storage, network transfer, observability, and support tools.
  • Assign real engineering hours for serving operations and incidents.

Rough math

  • Self-hosted monthly cost = GPU hours + shared infrastructure + operations overhead.
  • Useful GPU-hour cost = GPU spend / useful inference GPU-hours.
  • Effective request cost = total monthly serving cost / successful requests.

Red flags

  • The estimate compares API token spend to GPU hourly rate alone.
  • Utilization is unknown or low.
  • No one owns model serving incidents, upgrades, or rollback.

What to do next

  • Open the AI inference cost calculator.
  • Read the AI inference cost model.
  • Compare against API versus self-hosted inference.
  • Use the AI inference cost checklist before buying capacity.

Related resources

Use a worksheet before making the call

These supporting pages turn the decision into fields a buyer, engineer, or founder can actually compare.

Related decisions

Keep narrowing the placement question

Follow the adjacent pages when the first answer exposes a deeper cost driver or operating constraint.

Framework

Use the underlying decision model

These framework pages define the terms and formulas behind this specific decision.

AI inference cost quiz

Get an AI compute cost read

Call self-hosting cheaper only when total monthly serving cost per successful request beats API or managed inference after operations are included.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.
Start the AI compute read

FAQ

What should self-hosted LLM inference cost include?

Include GPU hours, idle capacity, utilization, storage, networking, observability, support tooling, deployment work, upgrades, rollback, and incident ownership.

When is self-hosted LLM inference worth considering?

Consider it when volume is high, utilization is predictable, control matters, and the team can operate the serving stack safely.

What is the common self-hosting cost mistake?

The common mistake is excluding idle GPU capacity and engineering overhead from the comparison.

Sources

AI inference cost quiz

Get an AI compute cost read

Call self-hosting cheaper only when total monthly serving cost per successful request beats API or managed inference after operations are included.

Uses actual request volume, latency, GPU need, data movement, priority, and ops tolerance.
Start the AI compute read