Token Optimization

Title: Token Optimization: Measuring, Analyzing, and Reducing the Cost of Generative AI

Target Audience: Professional development / adult continuing education. Primary learners are software engineers, machine learning engineers, platform engineers, technical leads, FinOps practitioners, and engineering managers responsible for the cost and performance of generative AI systems in production. Secondary learners include graduate students in computer science, data science, or information systems who want practical exposure to the economics of large language models.

Prerequisites:

Working knowledge of at least one programming language (Python preferred)
Familiarity with REST APIs and JSON
Basic command-line and Git skills
Conceptual exposure to large language models (LLMs) at the level of "I have used ChatGPT, Claude, or a similar tool"
Helpful but not required: experience with logging frameworks, cloud cost dashboards, or experiment tracking tools

Course Overview

In many organizations today, the token usage costs of generative AI tools are becoming a dominant factor in operating expenses. A single poorly designed prompt, a verbose system message, an unbounded context window, or an over-eager agent loop can multiply costs by ten or one hundred times without producing better outcomes. Yet very few engineers and managers have a rigorous, end-to-end understanding of where tokens come from, how they are billed, how to measure them, and how to systematically drive them down without hurting quality.

This course closes that gap. It begins with a clear, practical mental model of how large language models consume input tokens and produce output tokens, including how tokenization works, how context windows are billed, and how features like prompt caching, batch APIs, streaming, and tool use change the cost profile. From this foundation, the course builds a complete toolkit for measuring and optimizing token usage in real systems: structured logging of every request and response, dashboards and cost attribution, controlled A/B testing of prompts and models, regression analysis of log files, and patterns for prompt compression, retrieval-augmented generation tuning, model routing, and agent loop budgeting.

The course is deliberately vendor-pluralistic. It covers the three dominant ecosystems that engineering teams encounter today: Anthropic Claude (including the Claude API, prompt caching, and the Claude Code harness), OpenAI (including the Chat Completions and Responses APIs and the Codex coding harness), and Google Gemini (including the Gemini API and the Antigravity agentic coding harness). Learners finish the course able to instrument any of these systems, compare them on cost-quality tradeoffs, and design a token-aware architecture that survives contact with production traffic.

Main Topics Covered

A high-level mental model of LLMs: tokens, tokenizers, context windows, input vs. output token pricing, and why output tokens usually cost more
Tokenization deep dive: byte-pair encoding, how to count tokens before sending a request, and per-vendor tokenizer differences
The economics of generative AI: per-million-token pricing, cached vs. uncached input, batch discounts, and how to model unit economics for a feature
Anthropic Claude ecosystem: the Messages API, system prompts, prompt caching, extended thinking, tool use, and the Claude Code harness as a token consumer
OpenAI ecosystem: Chat Completions and Responses APIs, function calling, structured outputs, and the Codex coding harness
Google Gemini ecosystem: the Gemini API, long-context behavior, and the Antigravity agentic coding harness
Cross-vendor comparison: how to fairly benchmark the same task across Claude, OpenAI, and Gemini on cost, latency, and quality
Structured logging for LLM calls: what to log per request (model, input tokens, output tokens, cached tokens, latency, cost, prompt hash, user, feature, outcome) and how to keep logs privacy-safe
Observability and dashboards: building cost-per-feature, cost-per-user, and cost-per-outcome views from raw logs
Log file analysis: aggregating and slicing usage logs to find the top cost drivers, runaway prompts, and pathological agent loops
A/B testing methodology for prompts and models: hypothesis design, traffic splitting, metric selection (cost, quality, latency, satisfaction), sample size, and statistical significance
Prompt engineering for token efficiency: system prompt hygiene, instruction compression, few-shot pruning, and removing dead context
Prompt caching patterns: what to cache, cache key design, hit-rate measurement, and invariants that break caching silently
Retrieval-Augmented Generation (RAG) tuning for cost: chunk size, top-k selection, reranking, and dropping retrieved context that does not improve answers
Context window management: summarization, sliding windows, and compaction strategies for long conversations and agent sessions
Model routing and cascades: cheap-model-first patterns, escalation triggers, and confidence-based fallbacks
Output control: max_tokens, stop sequences, JSON-mode constraints, and reasoning-budget controls for thinking models
Agent and harness cost control: tool-call budgets, loop limits, and detecting agentic runaway in Claude Code, Codex, and Antigravity
Skills as a token-optimization primitive: what a Skill is, how harness tools decompose a large problem into tasks and bind each task to the appropriate Skill, why Skills are designed so that only a short trigger description sits in the context window (the body is loaded on demand), how Skills delegate work to shell scripts and Python code instead of generating it token-by-token, and how moving the deterministic portions of a Skill from prose into scripts can typically reduce per-invocation token usage by roughly 30 percent — with worked before/after examples
Batch and asynchronous APIs: when to move workloads off the synchronous path for large discounts
Capstone project: instrument a real or simulated application end-to-end and demonstrate a measurable token reduction without quality loss

Topics Not Covered

Training, fine-tuning, or pre-training of foundation models from scratch
Low-level GPU, CUDA, or inference-server optimization (vLLM, TensorRT, kernel-level work)
General cloud FinOps beyond what touches LLM spend (e.g., Kubernetes right-sizing, storage tiering)
The internal mathematics of transformer architectures beyond what is needed to reason about token economics
Image, audio, and video generation cost optimization (the focus is text and code tokens)
Vendor-specific enterprise procurement and contract negotiation
Building a foundation model company or going beyond the public APIs of the three covered vendors
Prompt injection and adversarial security (touched on only where it intersects with cost, e.g., abuse-driven token spikes)

Learning Outcomes

After completing this course, students will be able to:

Remember

Retrieving, recognizing, and recalling relevant knowledge from long-term memory.

Define the terms token, tokenizer, context window, input token, output token, cached token, and reasoning token
Recall the published per-million-token prices for the major Anthropic, OpenAI, and Google models for input, cached input, and output
List the headers and response fields that report token usage in the Anthropic, OpenAI, and Google APIs
Identify the standard fields that belong in a structured LLM call log (model, prompt hash, input tokens, output tokens, cached tokens, latency, cost, feature, user, outcome)
Name the primary harness tools for each vendor: Claude Code (Anthropic), Codex (OpenAI), and Antigravity (Google)
Recognize common token-cost antipatterns such as unbounded context, redundant system prompts, and runaway agent loops
Recall the major levers for reducing token cost: prompt caching, model routing, output limits, batch APIs, and RAG tuning
Define a Skill and recall how a harness binds a task to a Skill via its short trigger description
Recognize the parts of a Skill (trigger description, body, bundled scripts and assets) and which parts are loaded eagerly versus on demand

Understand

Constructing meaning from instructional messages, including oral, written, and graphic communication.

Explain why output tokens are typically billed at a higher rate than input tokens, and why cached input is cheaper still
Describe how byte-pair encoding produces tokens and why the same English text can produce different token counts across vendors
Explain how prompt caching works in the Claude API and how cache hits change the effective cost of a request
Summarize the difference between synchronous, streaming, and batch API modes and the cost implications of each
Describe how an agentic harness like Claude Code, Codex, or Antigravity accumulates tokens across a multi-turn session
Explain the relationship between context window size, latency, and cost for long-context workloads
Interpret a token usage log and explain what each field means and why it is needed for cost attribution
Describe the role of A/B testing in distinguishing real cost reductions from noise
Explain how a harness decomposes a large user request into discrete tasks and selects a Skill for each task based solely on the Skill's short trigger description
Explain why keeping only a Skill's trigger description in the context window — and lazy-loading the body — reduces baseline token consumption per session
Describe how a Skill can delegate deterministic work to a shell script or Python program rather than emitting equivalent tokens through the model

Apply

Carrying out or using a procedure in a given situation.

Use a vendor tokenizer to count tokens in a candidate prompt before sending it to the API
Instrument an existing application so that every LLM call writes a structured log line containing model, token counts, cost, and feature
Configure prompt caching on the Anthropic Claude API and verify cache hits using the response usage fields
Apply max_tokens, stop sequences, and JSON-mode constraints to bound output token usage on each vendor's API
Apply a model-routing pattern that sends easy requests to a cheaper model and escalates only when needed
Run a batch job using each vendor's batch or asynchronous API and compute the realized discount versus synchronous calls
Use the Claude Code, Codex, or Antigravity harness in a way that respects a configured per-session token or tool-call budget
Apply summarization or compaction to keep a long-running agent session within a target context size
Convert prose-heavy steps inside an existing Skill into a shell or Python script and measure the per-invocation token reduction (target: ~30% on representative tasks)
Author a Skill trigger description that is specific enough to fire reliably and short enough to keep baseline context cost low

Analyze

Breaking material into constituent parts and determining how the parts relate to one another and to an overall structure or purpose.

Analyze a log file of LLM calls to identify the top cost drivers by feature, user, model, and prompt template
Decompose a single expensive request into its system prompt, few-shot examples, retrieved context, user message, and output, and attribute cost to each component
Diagnose why a prompt cache hit rate is lower than expected by examining cache key invariants
Compare cost-per-successful-outcome across two prompt variants, two models, or two vendors using A/B test data
Trace an agent harness session and identify which tool calls or loop iterations contributed disproportionately to token cost
Analyze the cost-quality-latency tradeoff curve for a given task across Claude, OpenAI, and Gemini models
Distinguish between cost reductions that improve unit economics and those that hide quality regressions
Decompose a Skill into the portions that must remain as model-readable prose and the portions that can be moved into deterministic scripts
Analyze a harness session log to identify Skills whose trigger descriptions misfire (loaded but not used, or needed but not loaded) and quantify the token cost of each misfire

Evaluate

Making judgments based on criteria and standards through checking and critiquing.

Evaluate whether an A/B test result is statistically significant given the observed sample size and variance
Critique a proposed prompt change for its likely impact on cost, latency, quality, and cache hit rate
Judge whether a workload should run on the synchronous, streaming, or batch path based on latency requirements and cost targets
Assess the privacy and compliance risk of an LLM logging schema and recommend redaction strategies
Evaluate vendor lock-in risk when adopting a vendor-specific feature such as Anthropic prompt caching or Gemini long context
Recommend, with justification, a model and harness combination for a given engineering team based on cost, capability, and ecosystem fit
Critique an existing observability dashboard for missing dimensions needed to attribute cost to business outcomes
Judge which portions of a Skill belong in the model-facing description versus an external script, weighing token cost, correctness, and maintainability
Evaluate a candidate Skill trigger description for invocation precision (false positives and false negatives) and recommend revisions

Create

Putting elements together to form a coherent or functional whole; reorganizing elements into a new pattern or structure.

Design a structured logging schema for LLM calls that supports cost attribution by feature, user, model, and outcome
Design and run a controlled A/B test comparing two prompt or model variants, including hypothesis, metrics, traffic split, and stopping rule
Build an analysis notebook that ingests raw LLM call logs and produces a ranked list of cost-reduction opportunities
Construct a model-routing layer that selects between Claude, OpenAI, and Gemini models based on task type and budget
Design a prompt caching strategy for a real application, including cache key boundaries and a hit-rate monitoring plan
Develop an agent-harness budget policy that bounds per-session tool calls and tokens and gracefully degrades when the budget is exhausted
Design a Skill from scratch with an optimized split between a short trigger description, a concise body, and bundled scripts that handle deterministic work
Refactor an existing prose-heavy Skill into a script-backed version and produce a before/after token report demonstrating the reduction
Capstone project: Take a real or realistic application that calls an LLM, instrument it with structured logging, run a baseline cost measurement, propose at least three optimizations spanning prompt, caching, and routing, A/B test them against the baseline, and produce a final report demonstrating a measurable token-cost reduction with no significant quality regression
Alternative capstone: Build a vendor-neutral token observability dashboard that ingests logs from Claude, OpenAI, and Gemini calls and exposes cost-per-feature, cost-per-user, and cache-hit-rate views
Alternative capstone: Design and document a token budget policy for an autonomous coding harness (Claude Code, Codex, or Antigravity), including loop limits, escalation rules, and a reporting format that an engineering manager can review weekly