What this calculator actually does
This tool takes four inputs — input tokens per call, output tokens per call, calls per day, and cached input ratio — and returns the true monthly cost of running that workload on every major LLM API we track. It is built for the operator who is about to ship a feature into production and needs an honest number, not a vendor estimator.
Most vendor calculators show you the cost of a single prompt and stop there. The interesting question for a real product is what the cost looks like at your monthly volume, with your input shape, after prompt caching kicks in. That is what this calculator returns.
How LLM pricing actually works
Every major API charges by the token, a unit roughly equal to 0.75 English words or 4 characters of code. A short conversational reply might be 50–150 tokens; a long technical answer might be 1,500–3,000.
Vendors price input and output tokens separately, and output is almost always more expensive— typically 3–5× the input rate. This is because generating output is more compute-intensive than reading input. For workloads with long prompts and short outputs (e.g., classification or Q&A), input cost dominates. For workloads with short prompts and long outputs (e.g., long-form generation), output cost dominates.
A frontier-class call (Claude Opus 4.7 or OpenAI o1) can cost 100× more than the same call on a high-efficiency model (DeepSeek V3 or Gemini 2.0 Flash) for many tasks. The question is whether the quality difference is worth it for your use case.
The four levers that change cost dramatically
1. Choice of model
The single biggest cost lever is model selection. For most production workloads in 2026, you have three reasonable tiers:
- Frontier reasoning (GPT-5, Claude Opus 4.7, o1) — $15–60+ per million output tokens. Reserved for the hardest tasks: complex code reasoning, multi-step planning, research-grade analysis.
- Balanced general (GPT-4o, Claude Sonnet 4.6, Gemini 2.0 Pro) — $5–15 per million output. Most production workloads belong here.
- High-efficiency (GPT-4o mini, Claude Haiku 4.5, Gemini 2.0 Flash, DeepSeek V3) — $0.30–4 per million output. Customer support, classification, structured extraction, routing.
A common mistake is to use a frontier model where a balanced or efficient model would produce equivalent output. Test on your actual data before committing.
2. Prompt caching
Anthropic, OpenAI, Google, and DeepSeek all support some form of prompt caching, where repeated input tokens (typically a long system prompt or a stable knowledge base) are billed at a fraction of the base rate — usually 10%.
If your workload has any of the following patterns, caching is a huge cost lever:
- A long system prompt you send on every call.
- A fixed knowledge base you send before each user query.
- A few-shot examples block.
- Multi-turn conversation where the early turns are stable.
The calculator above lets you set a "cached input ratio." If 60% of your input tokens are repeated, set it to 0.60 — the math then applies the cached rate to that portion.
3. Batch APIs
For workloads that don't require real-time response (overnight processing, content generation pipelines, evaluation runs), most providers offer a batch API at 50% of base price. OpenAI's batch API has a 24-hour SLA; Anthropic and Gemini are similar. If your workload tolerates async, batch is the cheapest path that doesn't involve a model change.
4. Self-hosting vs API
For sustained high-volume workloads on open models (Llama 3.1 405B, Mistral Large), self-hosting on rented or owned GPUs can beat the per-token API at a high enough volume. Our self-host vs API breakeven calculator (in build) models this directly. For most teams under 100M tokens/month, the API is cheaper after accounting for ops overhead. If LLM cost is part of a broader SaaS audit, the SaaS subscription audit puts it in context alongside your other tool spend.
Real-world cost examples
To ground the numbers, here are three workloads we've modeled from real operator data, all at 1,000 calls per day:
A. Customer-support chat (800-token input, 200-token output)
- GPT-4o mini: $5/month
- Claude Haiku 4.5: $45/month
- Gemini 2.0 Flash: $3.6/month
- GPT-4o: $120/month
- Claude Sonnet 4.6: $162/month
For this workload, Gemini Flash and GPT-4o mini are 25–40× cheaper than GPT-4o or Claude Sonnet. The right choice depends on whether your customer-support replies need frontier quality (usually they don't).
B. Long-form content generation (1,500-token input, 2,000-token output)
- GPT-4o: $721/month
- Claude Sonnet 4.6: $1,036/month
- DeepSeek V3: $78/month
- Claude Opus 4.7: $4,725/month
The output-heavy nature of this workload makes the high-efficiency tier dramatically cheaper. Test quality carefully — for some creative work the frontier model is worth 9× the cost; for structured content it usually isn't.
C. RAG with 60% cached context (4,000-token input, 400-token output)
- Claude Sonnet 4.6 (no cache): $432/month
- Claude Sonnet 4.6 (60% cache): $216/month
- Gemini 2.0 Flash (no cache): $13/month
Prompt caching cuts Sonnet's cost roughly in half. But Gemini Flash is still ~16× cheaper than even cached Sonnet. The question is whether the answer quality justifies the gap.
Methodology & sources
All pricing is sourced from each vendor's public pricing page and re-verified every Monday. The current verification date is shown next to the calculator. If a vendor changes pricing mid-week, we publish the change in our changelog.
Sources used in this calculator:
- https://openai.com/api/pricing/
- https://www.anthropic.com/pricing#api
- https://ai.google.dev/pricing
- https://together.ai/pricing
- https://mistral.ai/technology/#pricing
- https://api-docs.deepseek.com/quick_start/pricing
- https://cohere.com/pricing
- https://www.ai21.com/jamba
- https://x.ai/api
- https://docs.perplexity.ai/guides/pricing
- https://www.together.ai/pricing
- https://fireworks.ai/pricing
- https://groq.com/pricing/
We do not track every model offered by every provider — only the models we judge to be relevant for production workload selection in 2026. If a model is missing that you think should be included, email us via the contact page.
What this calculator does not yet model
Honest disclosure: this tool focuses on the per-token cost of text-in / text-out workloads. It does not currently model:
- Vision input pricing (most providers price images per-image or per-megapixel, not per-token).
- Audio / speech input and output pricing (Whisper, Realtime API, Gemini audio).
- Fine-tuning cost (covered in a separate tool, in build).
- Embedding generation cost (covered in our embedding & RAG cost estimator, planned).
- Tool use / function call overhead (negligible for most workloads but can matter at extreme scale).
Frequently asked questions
How much does the OpenAI API cost per call?
It depends on the model and the size of the input and output. GPT-4o is $2.50 per million input tokens and $10 per million output tokens. A typical 1,500-token input with 500-token output costs about $0.009 per call. GPT-4o mini is roughly 17× cheaper.
Which LLM is cheapest for high volume?
For most workloads in mid-2026, DeepSeek V3 and Gemini 2.0 Flash are the cheapest frontier-class options, often 10–30× cheaper than GPT-4o or Claude Sonnet at equivalent quality for many tasks.
Does cached input really save money?
Yes — significantly. Anthropic and DeepSeek charge ~10% of base rate for cached input. For workloads where 50%+ of the input is repeated (system prompts, knowledge bases, long context windows), cached input alone can cut total cost by 30–45%.
Is Claude API cheaper than OpenAI?
It depends on the tier. Claude Haiku 4.5 is priced competitively with GPT-4o mini for the lightweight tier. Claude Sonnet 4.6 and GPT-4o are broadly similar in the balanced tier. Claude Opus 4.7 and GPT-4o are both frontier-priced. For most cost-sensitive workloads, open-source alternatives (DeepSeek V3, Llama 3 via Together AI) are cheaper than both.
How much does it cost to run an AI chatbot for 1,000 users per day?
Assuming each user sends 3 messages averaging 200-token inputs with 150-token responses, you're processing 1,050,000 input tokens and 450,000 output tokens per day. On GPT-4o mini, this costs roughly $0.70/day or ~$21/month. On GPT-4o, the same workload runs ~$360/month. On Gemini 2.0 Flash, expect ~$6/month. The 60× cost gap between the cheapest and most expensive option is the case for careful model selection.
What is a token in GPT-4 or Claude?
One token ≈ 0.75 English words or 4 characters of typical text. Rough rules: 1,000 words ≈ 1,333 tokens; 1-page PDF ≈ 400–700 tokens; 100-line Python file ≈ 500–800 tokens. Multilingual text tokenizes differently — Japanese, Chinese, and Korean typically use more tokens per word than English, which raises costs for non-English workloads.
How do I reduce my OpenAI or Anthropic API bill?
The five most effective tactics:
- Switch to a smaller model for tasks that don't require frontier quality (most don't).
- Implement prompt caching for repeated system prompts or stable context blocks.
- Use the batch API for non-real-time workloads — 50% discount with a 24-hour SLA.
- Trim output tokens: prompt for concise answers and set a strict
max_tokenslimit. - Cache LLM responses at the application layer for repeated identical queries (semantic cache).
Should I switch from GPT-4o to a cheaper model?
Test on your actual workload. Run 100–200 representative examples through both models and have a human (or a stricter evaluator model) grade the output. If quality is equivalent at your bar, switch. If not, the cost gap may still be justified — but you should have a number for the tradeoff, not a hunch.
How accurate is the cost estimate?
Within 1–3% of actual billing for the workloads we've validated (cross-checked against real OpenAI and Anthropic invoices). The main source of error is token counting — vendor tokenizers produce slightly different counts for the same input. The calculator assumes typical English text token density.
Last updated: 2026-05-19. Found a pricing error? Email support@smartcloudsuites.com and we'll usually fix within 48 hours.