RAG Pipeline Cost Annotations
About This MicroSim
A horizontal RAG pipeline (query → embed → vector search → rerank → prune → inject → generate) with live cost annotations at every stage. The injected-token count and total per-query cost recompute on every slider move so the highest-leverage stages — context injection and main-LLM generation — are visually obvious.
Toggle the reranker or pruning off to see the cost penalty (typically 4-10× more injected tokens). The "$" annotations on each node show where the bill actually accumulates: vector search and embedding are nearly free, the reranker is a thousandth of a cent, and the LLM generation dominates.
How to Use
- Start at defaults (K=20, N=5, reranker on, pruning on). Read the per-stage cost and the total. The total is roughly $0.014/query — typical for production RAG.
- Sweep K from 5 to 50. Watch the candidate token count grow but the injected token count stay roughly constant (because reranker still narrows to 5).
- Sweep N from 1 to 20. This is the slider that actually changes injected tokens — and therefore the LLM input cost.
- Disable reranker. The full K chunks now flow through to context injection. Watch total cost roughly quadruple at K=20.
- Disable pruning (with reranker still on). Less dramatic but still measurable. Pruning is the secondary tuning lever.
- Find your team's sweet spot. The default settings produce ~$0.014/query. Could your team accept a higher K with a tighter reranker N, trading vector-search cost for LLM cost?
Bloom Level
Analyze (L4) — examine where in the RAG pipeline retrieved tokens accumulate and identify the highest-leverage stages for cost reduction.
Iframe Embed Code
1 2 3 4 | |
Lesson Plan
Audience
Engineers building or tuning RAG systems on top of LLM APIs.
Duration
15–20 minutes inside Chapter 15.
Prerequisites
- Chapter 15 sections on chunking, top-K retrieval, reranking, and context pruning
- Chapter 3 sections on per-million-token pricing
Activities
- Identify cost dominance (5 min). At defaults, which stage costs the most? (Answer: LLM generation, dominating ~95% of the bill.)
- Reranker payoff exercise (5 min). Disable the reranker. Compute the cost increase. Then ask: at what query volume does the reranker pay for itself if it costs $0.001/query but saves you $0.005/query in LLM input cost?
- K vs N sensitivity (5 min). Set K=50, N=5. Compare to K=10, N=5. Discuss: how much does K really matter when reranker is on?
- Practice scenarios.
Practice Scenarios
| # | Configuration | Predict total $/query | Highest-leverage knob? |
|---|---|---|---|
| 1 | K=20, N=5, both on | ? | ? |
| 2 | K=20, N=5, reranker off | ? | ? |
| 3 | K=10, N=3, both on | ? | ? |
| 4 | K=50, N=10, both on | ? | ? |
| 5 | K=50, N=10, reranker off | ? | ? |
Assessment
A learner has met the objective when they can identify (without the calculator) which stage of a given RAG configuration produces the most cost, predict the cost penalty of disabling reranker or pruning, and choose K/N values that match a target cost-per-query budget.
References
- Anthropic Cookbook — RAG patterns — discussion of chunking and reranking trade-offs.
- Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — original RAG paper.
- Cohere Reranker documentation — cross-encoder reranker pricing referenced here.
- Pinecone Learning Center: Optimizing RAG Performance — practical operations guide.
Senior Instructional Designer Quality Review
Reviewer perspective: 15+ years designing engineering and data-science curricula for adult professional learners; expertise in Bloom's revised taxonomy, evidence-based assessment design, and accessibility of technical content.
Overall verdict
Strong fit for the stated learning objective. Approve as-is for Chapter 15. Score: 89/100 (B+). The "live cost annotations on every stage" design choice is exactly what L4 "examine" demands — the learner cannot help but see where the cost is, and that's the key analytical skill for RAG cost tuning.
What works (the pedagogy)
- Bloom alignment is correct. L4 "examine" requires the learner to break apart and see relationships. The cost annotations make the per-stage decomposition unavoidable.
- Cost dominates visibility, not pipeline structure. Most RAG diagrams show the structure first and bill second. This one inverts: the dollar amounts on each node make the cost lever immediately obvious.
- Disable-toggles for reranker and pruning. Showing the cost penalty of disabling each is the comparison move L4 needs. The toggles are pedagogically equivalent to the natural question "what if I just skip this step?"
- K and N as separate sliders. Most RAG demos conflate the two. Splitting them lets the learner see that K barely matters once the reranker is on.
What needs follow-up (the gaps)
- No total-bill panel showing monthly cost. Per-query cost is the local concern; monthly bill is what gets a budget approved or rejected. A "× 1M queries/month = $X/month" annotation would translate the per-query cost to budget-language. Score impact: −3.
- No comparison to no-RAG baseline. A learner could ask: how much does RAG cost compared to just feeding the whole corpus into the system prompt? A toggle to "show no-RAG baseline" would make the savings of retrieval visible. Score impact: −2.
- Embedding cost is not adjustable. Real production decisions include "should we use a cheaper embedding model?" — the current design treats embedding as a single fixed line. Score impact: −1.
- No latency annotations. RAG is also a latency story (each stage adds delay). The cost-only framing is the right primary lens for this textbook, but a secondary "show latency overlay" toggle would round out the picture. Score impact: −2.
Accessibility and clarity
- The cost annotations inside Mermaid nodes are 12px — readable but at the lower edge.
- Color choices (blue / purple / russet / green) are color-blind safe.
- Slider labels show numeric values at all times. ✓
Cognitive load assessment
- 8 nodes + total. Tractable.
- 4 controls in the toolbar. At the edge of "no instructions needed."
- Live update on slider input means the learner gets immediate feedback — good for L4.
Recommendation
Approve for use in Chapter 15. The four gaps are real but none are blockers; consider opening a follow-up ticket for the monthly-bill annotation (item 1) since it's the highest-impact pedagogical add. Ship.