Chapters

This textbook is organized into 20 chapters covering 475 concepts.

Chapter Overview

LLMs, Tokens, and Generation Basics (26 concepts) — The foundational chapter establishing the core vocabulary used throughout the rest of the book: what an LLM is, what tokens are, how prompts and conversations are structured, and the basic generation parameters that control model output.
Sampling, Tokenization, and Embeddings (21 concepts) — A deeper look at how text becomes numbers: byte-pair encoding mechanics, special tokens, multilingual and code tokenization, the embedding concept that underpins both retrieval and similarity, and the sampling parameters (temperature, top-p, logprobs) that control generation.
Pricing, Economics, and Async API Modes (27 concepts) — The financial framing of LLM usage: per-million-token pricing, unit economics, cost-per-feature/user/outcome attribution, token budgets, the cost-quality and cost-latency frontiers, rate limits and quotas, plus the generic Batch API and Asynchronous API patterns that vendors will specialize.
The Anthropic Claude Ecosystem (26 concepts) — Anthropic's Messages API in depth: the Claude Opus/Sonnet/Haiku family, the Claude tokenizer, prompt caching with cache control parameters, extended thinking and thinking token budgets, tool use, streaming, batch, and Claude vision input.
The OpenAI Ecosystem (26 concepts) — OpenAI's API surface: Chat Completions and the newer Responses API, the GPT and o-series model lines, the tiktoken library, function calling and JSON mode, structured outputs, OpenAI batch and streaming, and the precise shape of OpenAI's token usage object.
The Google Gemini Ecosystem (22 concepts) — Google's Gemini API: the Pro/Flash/Ultra lineup, the long context window and the one-million-token mode, Vertex AI and AI Studio surfaces, Gemini caching and grounding, plus the synthesis concept of cross-vendor tokenizer drift now that all three vendors have been introduced.
AI Coding Harnesses and Agentic Loops (29 concepts) — How harness tools (Claude Code, OpenAI Codex CLI, Google Antigravity) accumulate tokens across multi-turn sessions: agentic and tool-use loops, conversation compaction and summarization, agent memory, multi-step reasoning, subagent patterns, and the cost difference between serial and parallel execution including the parallel token penalty.
The Skills System (25 concepts) — Skills as a token-optimization primitive: the anatomy of a Skill (description, body, frontmatter, bundle, scripts), trigger design and precision, lazy loading versus eager listing, task decomposition and task-skill binding, skill misfires, and the practice of refactoring prose-heavy skills into script-backed versions for substantial token savings.
Structured Logging for LLM Calls (27 concepts) — The instrumentation foundation that the rest of the book depends on: log schema design, JSON log fields, the standard set of LLM call fields (model, prompt hash, token counts, cost, latency, feature, user, outcome), session and trace identifiers, log sampling and retention, and the privacy primitives (data privacy, PII detection, PII redaction) that any LLM logging system must include from day one.
Observability, Dashboards, and Alerting (20 concepts) — From raw logs to actionable signals: OpenTelemetry and OTel LLM conventions, metrics (counter, histogram), dashboards for cost/hit-rate/latency, time-series aggregation, anomaly detection, alerting rules and thresholds, cardinality concerns, and cross-service tracing across LLM calls.
Log File Analysis and Cost Hotspots (20 concepts) — How to find the money: log aggregation, top-N cost drivers, Pareto analysis, outlier detection, runaway prompts, pathological agent loops, cost roll-ups by feature/user/model, prompt template grouping, percentile analysis (P50/P95/P99), and the analysis-notebook workflow that ties them together.
A/B Testing Methodology for LLMs (25 concepts) — Distinguishing real cost reductions from noise: hypothesis design, control and treatment groups, traffic split, primary and guardrail metrics (cost, quality, latency, satisfaction), sample size and statistical power, statistical significance and effect size, sequential testing, multi-armed bandits, CUPED, and novelty effects.
Prompt Engineering for Token Efficiency (25 concepts) — Reducing tokens at their source: system prompt hygiene, instruction compression, few-shot pruning, chain-of-thought tradeoffs, prompt templates and variable interpolation, prompt and output length budgets, token-aware rewriting, whitespace and comment stripping, and the discipline of reusable prompt blocks.
Prompt Caching Patterns (20 concepts) — The single highest-leverage cost optimization: cache keys, hits, misses, hit-rate metrics, warming, invalidation, the stable-prefix and volatile-suffix structure that cache mechanics demand, cross-vendor caching differences, implicit versus explicit caching, eviction, the cache stampede problem, and cache-aware routing.
Retrieval-Augmented Generation Optimization (24 concepts) — Tuning RAG so retrieved context earns its tokens: vector databases, chunking strategies, top-K retrieval and reranking, hybrid (BM25 + dense) retrieval, query rewriting and HyDE, document compression, summarization-based RAG, citation of sources, RAG cost analysis, and the precision/recall tradeoffs of retrieval.
Context Window Management (14 concepts) — Keeping long-running sessions affordable: sliding windows, hierarchical summarization, memory files, the long-term/short-term memory split, context truncation strategies, the lost-in-the-middle effect, context reordering, and eviction policies for context that has stopped earning its keep.
Model Routing and Output Control (30 concepts) — Spending output tokens deliberately: cheap-first cascades, escalation triggers and confidence thresholds, fallback models, cross-vendor routing, vendor-neutral abstractions, plus the full output-control toolkit (max tokens, stop sequences, JSON schema enforcement, reasoning effort and thinking-token settings).
Agent Budget Policies and Session Limits (22 concepts) — Bounding what an autonomous harness can spend: per-session token and tool-call budgets, loop iteration limits, wall-clock limits, cost caps and graceful degradation, runaway detection, circuit breakers, tool-call throttling, per-engineer and per-PR budgets, the vendor-imposed 5-hour and weekly session limits, and budget-versus-outcome reporting.
Batch Job Operations, Privacy, and Compliance (26 concepts) — Operating batch workloads safely: job submission, status, windows and discount rates, idempotency keys, retry and backoff strategies, plus the compliance frameworks (GDPR, HIPAA, SOC2), data residency, vendor data retention, opt-out of training, hashing of sensitive strings, anonymization, and audit trails that production deployments require.
Capstone Projects and Continuous Practice (20 concepts) — Putting it all together: baseline cost measurement, optimization hypotheses, before-after reports, optimization backlogs, canary and pilot rollouts, the three capstone projects (token dashboard, vendor-neutral logging, skill refactor), eval suites, golden test sets, regression test loops, continuous cost monitoring, and the long-term token-efficiency roadmap.

How to Use This Textbook

Read the chapters in order. The dependency graph used to generate this structure guarantees that no chapter introduces a concept whose prerequisites have not yet been covered. Sophisticated readers may skim earlier chapters and jump to specific topics — the per-chapter concept lists tell you exactly which concepts are introduced where.

Note: Each chapter index lists the concepts it introduces. The full learning graph and dependency relationships are available in the Learning Graph section.