RAG Cost Calculator — Embeddings + Vector DB + LLM / Month

RAG cost calculator

✓ Last verified: 2026-07-15· Source: official provider pricing page· Auto-monitored — report change →

Retrieval-augmented generation has three bills: embedding your docs once, the vector DB every month, and the LLM on every query. Set your numbers and see the one-time cost, the monthly cost, and the cost per question.

Where RAG money actually goes

People assume embeddings are the expensive part of RAG. They're usually the cheapest. Embedding 10,000 short documents is often a few cents, paid once. The real, recurring cost is the LLM on every query — and specifically the retrieved context you feed it. Pull 5 chunks of 400 tokens and you've added 2,000 input tokens to every single question before the user even finishes typing. Multiply by your query volume and that's your bill.

The big levers, in order: retrieve fewer / smaller chunks (top-k and chunk size), use a cheaper answer model for routine queries, and cap answer length. Re-embedding only changed documents (not the whole corpus) keeps the one-time cost from recurring. Building the rest of the app? Price a support bot with the chatbot cost calculator, a whole feature with the AI app cost estimator, or the full backend with the API stack cost calculator. Just sizing the storage layer? See the vector database cost calculator for Pinecone vs Qdrant vs pgvector.

How this calculator works

The RAG Cost Calculator estimates the monthly cost of running a retrieval-augmented generation pipeline by combining three separate bills. First, it computes a one-time embedding cost from your document count, average tokens per document, and chunk size, priced by the selected embedding model. Second, it accounts for the vector database that stores those chunk embeddings. Third, and usually the largest, it computes the recurring per-query LLM cost: your monthly query volume multiplied by the input tokens of the retrieved chunks (top-k) plus the generated answer length, priced by the chosen answer model. Query volume, top-k, and the answer model are the main drivers.

Watch the top-k retrieval setting closely. Each retrieved chunk is added to every query's prompt, so raising top-k or chunk size directly inflates input tokens on every query for the whole month. Retrieving more context can improve answer quality, but the cost scales linearly with queries. A practical tip: retrieve fewer, tighter chunks and reserve a larger, pricier answer model only where accuracy justifies it. Embedding is a fixed upfront cost, whereas the per-query LLM bill compounds with usage, so optimize the query path first.

Frequently asked questions

What makes up the cost of a RAG system?

Three parts. One-time: embedding all your documents into vectors. Ongoing monthly: the vector database that stores and searches those vectors, plus the per-query LLM cost — which includes the retrieved chunks you stuff into the prompt as context. For most RAG apps the per-query LLM cost dominates, because retrieved context can be far larger than the user's actual question.

Why is retrieved context the biggest RAG cost?

Every query sends the top-k retrieved chunks to the model as input tokens. Retrieve 5 chunks of 400 tokens and you are paying for 2,000 input tokens per query before the user has said much. Retrieving fewer or smaller chunks, or using a cheaper model, cuts the bill far more than changing embedding models does.

Cheapest answer model for this RAG

Where RAG money actually goes

How this calculator works

Frequently asked questions

What makes up the cost of a RAG system?

Why is retrieved context the biggest RAG cost?