Every model, ranked by cost with caching
Same workload, all models side by side. "Cached read" is the price of a reused token; "write" is the one-time cost to store it.
| Model | Input | Cached read | No cache /mo | With cache /mo | Saved |
|---|
What prompt caching is
Most LLM apps send the same big chunk of text on every call — a long system prompt, tool definitions, a style guide, or retrieved documents in a RAG pipeline. Normally you pay full input price for those tokens every single time. Prompt caching lets the provider store that fixed prefix after the first call and charge a heavily discounted rate to reuse it. On Claude a cached token reads at roughly 10% of the normal input price; on GPT and Gemini cached input is about 25–50% cheaper. The only requirement is that the cached part stays identical and at the front of the prompt — put the variable part (the user's question) last.
How the math works
The calculator splits each request into two parts. The reused prompt tokens can be cached: on a cache hit they bill at the cheap read rate, on a miss they bill at the (slightly more expensive) write rate to refresh the cache. The fresh tokens — the user's actual input that's different every time — always pay the full input price and can't be cached. Your cache hit rate decides the mix: caches expire after a few minutes of inactivity (Anthropic's default TTL is ~5 minutes), so steady high-traffic apps keep the cache warm and hit 90%+, while bursty or low-volume apps refresh more often and hit less. Try dropping the hit rate to 50% and watch the savings shrink — caching mainly rewards volume and a stable prefix.
When caching is worth it (and when it isn't)
Caching pays off hardest when a large, unchanging context repeats across many calls: RAG chatbots re-sending the same documents, agents replaying long tool definitions and histories, or any product with a heavy system prompt. In those cases it routinely cuts input cost by 50–90%. It does nothing for one-off prompts that never repeat, and little if your reused prefix is tiny. It also doesn't touch output token cost — caching only discounts input. To model the full picture including output, use the AI app cost estimator or the per-model Claude / GPT-4o / Gemini calculators.
FAQ
How much does prompt caching actually save?
If 80–90% of your input is a fixed prompt or context that repeats, caching commonly cuts total input cost by 50–90%. Savings scale with volume and how stable the reused prefix is, and are near zero for prompts that never repeat.
Does it cost extra to write the cache?
On Claude, writing the cache costs ~25% more than a normal input token once, but each later read is ~90% cheaper — it pays for itself after about two hits. OpenAI caching is automatic with no write surcharge. Gemini adds a small per-hour storage fee for the cached content.
What is a cache hit rate?
The share of requests that find your prompt already stored. Caches expire after a few minutes of inactivity, so high-traffic apps keep them warm (90%+) while bursty or low-volume apps refresh more often and hit less.
Does caching reduce output token cost?
No. Prompt caching only discounts the input (prompt) side. Output / completion tokens are always billed at the normal rate, so a chatty, long-answer app saves less overall than a heavy-context, short-answer one.
Related tools & guides
AI app cost estimator · RAG cost calculator · Claude cost calculator · GPT-4o cost calculator · OpenAI vs Claude vs Gemini pricing · the API costs that double your bill