LLM API Cost Calculator

Model	Per call	Daily	Monthly	Annual	vs cheapest
DeepSeek V4 ProCheapest DeepSeek · Efficient frontier	$0.0009	$0.893	$26.80	$322	—
GPT-5.4 Mini OpenAI · Efficient	$0.0031	$3.07	$92.14	$1,106	3.4×
Claude Haiku 4.5 Anthropic · Efficient	$0.0036	$3.60	$108	$1,294	4.0×
Gemini 3.5 Flash Google · Efficient	$0.0061	$6.14	$184	$2,211	6.9×
GPT-5.4 OpenAI · Multimodal	$0.010	$10.24	$307	$3,686	11.5×
Claude Sonnet 4.6Highest Anthropic · Balanced	$0.011	$10.79	$324	$3,883	12.1×

What this calculator actually does

This tool takes four inputs — input tokens per call, output tokens per call, calls per day, and cached input ratio — and returns the true monthly cost of running that workload on every major LLM API we track. It is built for the operator who is about to ship a feature into production and needs an honest number, not a vendor estimator.

Most vendor calculators show you the cost of a single prompt and stop there. The interesting question for a real product is what the cost looks like at your monthly volume, with your input shape, after prompt caching kicks in. That is what this calculator returns.

How LLM pricing actually works

Every major API charges by the token, a unit roughly equal to 0.75 English words or 4 characters of code. A short conversational reply might be 50–150 tokens; a long technical answer might be 1,500–3,000.

Vendors price input and output tokens separately, and output is almost always more expensive— typically 3–5× the input rate. This is because generating output is more compute-intensive than reading input. For workloads with long prompts and short outputs (e.g., classification or Q&A), input cost dominates. For workloads with short prompts and long outputs (e.g., long-form generation), output cost dominates.

The price spread between frontier and efficiency-focused models can be substantial. The live result table quantifies that spread for the token mix you enter; the remaining question is whether the quality difference is worth it for your use case.

The four levers that change cost dramatically

1. Choice of model

The single biggest cost lever is model selection. For most production workloads in 2026, you have three reasonable tiers:

Frontier reasoning — reserved for hard tasks such as complex code reasoning, multi-step planning, and research-grade analysis.
Balanced general — a practical starting point for production workloads that need stronger reasoning without automatically paying for the highest tier.
High-efficiency — candidates for customer support, classification, structured extraction, and routing, subject to quality testing.

A common mistake is to use a frontier model where a balanced or efficient model would produce equivalent output. Test on your actual data before committing.

2. Prompt caching

Anthropic, OpenAI, Google, and DeepSeek all support some form of prompt caching, where repeated input tokens (typically a long system prompt or a stable knowledge base) are billed at a model-specific discounted rate.

If your workload has any of the following patterns, caching is a huge cost lever:

A long system prompt you send on every call.
A fixed knowledge base you send before each user query.
A few-shot examples block.
Multi-turn conversation where the early turns are stable.

The calculator above lets you set a "cached input ratio." If 60% of your input tokens are repeated, set it to 0.60 — the math then applies the cached rate to that portion.

3. Batch APIs

For workloads that don't require real-time response (overnight processing, content generation pipelines, evaluation runs), most providers offer a batch API at 50% of base price. OpenAI's batch API has a 24-hour SLA; Anthropic and Gemini are similar. If your workload tolerates async, batch is the cheapest path that doesn't involve a model change.

4. Self-hosting vs API

For sustained high-volume workloads on open models, self-hosting on rented or owned GPUs can beat a per-token API after utilization is high enough. The comparison must include idle capacity, engineering time, monitoring, and failover rather than GPU rent alone. If LLM cost is part of a broader infrastructure budget, compare it with the GPU Price Index and your own production token logs before switching architecture.

Model a real workload without stale price examples

Fixed dollar examples age as soon as a provider changes a rate. Use the presets above as reproducible workload definitions, then adjust volume and token counts to match your logs:

Customer support: 800 input, 200 output, 40% cached.
Long-form content: 1,500 input, 2,000 output, no cache assumed.
RAG / Q&A: 4,000 input, 400 output, 60% cached.

The result table recalculates per-call, daily, monthly, and annual totals from the current dataset. Treat it as a budget baseline, then compare it with one week of production token logs.

Methodology & sources

All pricing is sourced from each vendor's public pricing page and reviewed on a seven-day cadence. The last completed verification date is shown next to the calculator. Material changes are published in our changelog.

Sources used in this calculator:

We do not track every model offered by every provider — only the models we judge to be relevant for production workload selection in 2026. If a model is missing that you think should be included, email us via the contact page.

Scope limitations

Honest disclosure: this tool focuses on the per-token cost of text-in / text-out workloads. It does not currently model:

Vision input pricing (most providers price images per-image or per-megapixel, not per-token).
Audio / speech input and output pricing (Whisper, Realtime API, Gemini audio).
Fine-tuning and training cost.
Embedding generation and vector-database cost.
Tool use / function call overhead (negligible for most workloads but can matter at extreme scale).

Frequently asked questions

How much does the OpenAI API cost per call?

It depends on the selected model, input tokens, output tokens, and cached-input share. Enter those values above to apply the currently tracked rate.

Which LLM is cheapest for high volume?

There is no universal cheapest model. The answer changes with workload shape, caching, and the minimum quality the task needs. The table ranks the selected models for the exact inputs above.

Does cached input really save money?

It can when a meaningful share of input is repeated. Discounts and eligibility rules vary by provider and model, so this calculator uses the tracked model-specific cached-input rate.

Is Claude API cheaper than OpenAI?

It depends on the exact models and workload. Compare current input, output, and cached-input rates, then test both candidates on representative production prompts before switching.

How much does it cost to run an AI chatbot for 1,000 users per day?

The total depends on messages per user, growing conversation history, response length, model choice, and caching. Model those inputs above; a single fixed monthly figure is not reliable across chatbot designs.

What is a token in GPT-4 or Claude?

One token ≈ 0.75 English words or 4 characters of typical text. Rough rules: 1,000 words ≈ 1,333 tokens; 1-page PDF ≈ 400–700 tokens; 100-line Python file ≈ 500–800 tokens. Multilingual text tokenizes differently — Japanese, Chinese, and Korean typically use more tokens per word than English, which raises costs for non-English workloads.

How do I reduce my OpenAI or Anthropic API bill?

The five most effective tactics:

Switch to a smaller model for tasks that don't require frontier quality (most don't).
Implement prompt caching for repeated system prompts or stable context blocks.
Use the batch API for non-real-time workloads — 50% discount with a 24-hour SLA.
Trim output tokens: prompt for concise answers and set a strict max_tokens limit.
Cache LLM responses at the application layer for repeated identical queries (semantic cache).

Should I switch from GPT-5.4 to a cheaper model?

Test on your actual workload. Run 100–200 representative examples through both models and have a human (or a stricter evaluator model) grade the output. If quality is equivalent at your bar, switch. If not, the cost gap may still be justified — but you should have a number for the tradeoff, not a hunch.

How accurate is the cost estimate?

Within 1–3% of actual billing for the workloads we've validated (cross-checked against real OpenAI and Anthropic invoices). The main source of error is token counting — vendor tokenizers produce slightly different counts for the same input. The calculator assumes typical English text token density.

Last updated: 2026-06-28. Found a pricing error? Email support@smartcloudsuites.com and we'll usually fix within 48 hours.