Skip to content

The Google Gemini Ecosystem

Summary

Google's Gemini API: the Pro/Flash/Ultra lineup, the long context window and the one-million-token mode, Vertex AI and AI Studio surfaces, Gemini caching and grounding, plus the synthesis concept of cross-vendor tokenizer drift now that all three vendors have been introduced.

Concepts Covered

This chapter covers the following 22 concepts from the learning graph:

  1. Gemini Tokenizer
  2. Cross-Vendor Tokenizer Drift
  3. Google Gemini API
  4. Gemini Model Family
  5. Gemini Pro
  6. Gemini Flash
  7. Gemini Ultra
  8. Gemini SDK
  9. Long Context Window
  10. One Million Context
  11. Gemini Function Calling
  12. Gemini Tool Config
  13. Gemini Streaming
  14. Gemini Batch Mode
  15. Gemini Caching
  16. Vertex AI
  17. Google AI Studio
  18. Gemini Safety Settings
  19. Gemini Multimodal Input
  20. Gemini Code Execution
  21. Gemini Grounding
  22. Gemini Token Counting

Prerequisites

This chapter builds on concepts from:


Vendor #3: Gemini, Plus the Cross-Vendor View

Pemba waving welcome Gemini is the third major vendor and the one with the most distinctive cost shape — the longest context window in the industry (1M tokens), aggressive Flash pricing, and built-in grounding to Google Search. Once we finish here, you'll have all three vendors mapped, and we'll close with a cross-vendor view: how the same prompt costs different amounts of money on each, and what to do about it.

The Gemini API and Two Surfaces

The Google Gemini API is the HTTPS surface for Gemini models, and it is reachable through two distinct product surfaces with subtly different defaults and pricing.

Google AI Studio is the developer-facing surface — a free playground at aistudio.google.com and a public API at generativelanguage.googleapis.com. AI Studio targets individual developers and small teams; it has a generous free tier, simple API key authentication, and the same model lineup as Vertex.

Vertex AI is the enterprise surface — the Gemini models exposed through Google Cloud at <region>-aiplatform.googleapis.com. Vertex uses GCP IAM for authentication (no separate API key), supports VPC-SC for private networking, offers data-residency guarantees, and bills through your existing GCP project. Vertex pricing per token is similar to AI Studio but unlocks enterprise features like grounding, customer-managed encryption keys, and committed-use discounts.

The decision rule: prototype on AI Studio, ship to production on Vertex if your organization is GCP-shaped or has compliance requirements; stay on AI Studio if you're a small team without those constraints. Both surfaces use the same model names and tokenize text identically, so moving between them is straightforward.

The Gemini SDK is the official client — google-genai on PyPI is the modern unified client that works against both AI Studio and Vertex, configured by which credentials you provide. The older google-generativeai and google-cloud-aiplatform packages still exist; for new code, use google-genai.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from google import genai
client = genai.Client(api_key="...")  # AI Studio mode
# or
client = genai.Client(vertexai=True, project="...", location="us-central1")  # Vertex mode

resp = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="What is the capital of France?",
)
print(resp.text)
print(resp.usage_metadata)

The Gemini Model Family

The Gemini model family spans three capability tiers, named on a similar pattern to the Claude family:

  • Gemini Pro is the high-capability tier — the model to reach for when reasoning quality matters. Comparable in role to Claude Sonnet or GPT-4o. Supports the long context window mode.
  • Gemini Flash is the high-throughput tier — fast, cheap, and surprisingly capable. The default first choice for production workloads, especially anything with high request volume. Often costs less than $0.20/MTok input on the smaller Flash variants.
  • Gemini Ultra is the largest, slowest, and most capable tier (when available; not all generations ship Ultra). Reserve for genuinely hard tasks where the cost is justified.

The pricing differential between tiers is the most aggressive of the three vendors — Flash is often 10–20× cheaper than Pro on the same generation. That makes Gemini particularly well-suited to routing patterns (Chapter 17) where the cheap-first cascade really pays off.

Tier Typical Use Relative Cost Long Context Support
Flash High-throughput, classification, extraction 1× (baseline) Yes
Pro General-purpose reasoning ~5×–10× Flash Yes (1M-token mode)
Ultra Hardest reasoning tasks ~5× Pro Yes

The Long Context Window

Gemini's defining feature is the long context window. Pro and Flash both support context lengths well beyond what other vendors offer at production-grade quality, and one million context mode (1M tokens) is available on Pro and selected Flash variants. For comparison, Claude's standard context is 200K and OpenAI's flagship models top out at a few hundred thousand.

A million tokens is roughly 750,000 words — about 10 average novels, or the entire codebase of a medium-sized project, or thousands of pages of legal documents. The cost shape this enables:

  • Single-shot document Q&A on documents that would otherwise require RAG (Chapter 15) — you can paste the whole document into the prompt and ask questions directly
  • Whole-codebase code search and explanation — a single prompt can include an entire repository
  • Long-form analysis where prior context matters — multi-hour conversations stay coherent without external memory

The catch is twofold. First, you pay for every input token, so a 500K-token request is genuinely expensive even if the answer is one sentence — the per-call cost can reach several dollars on Pro. Second, context quality is not uniform across long windows; the lost-in-the-middle effect (Chapter 16) means content buried in the middle of a 500K-token prompt may be retrieved less reliably than content near the start or end. Long context is a tool, not a panacea — pair it with attention to placement.

Gemini Caching

Gemini caching is Google's prompt-caching mechanism, structurally similar to Anthropic's but with a different lifecycle and pricing model. There are two flavors:

  • Implicit caching is automatic — the API caches stable prefixes of repeated requests without any opt-in, and applies a discount to subsequent hits. You get savings without code changes; you also have less control over what gets cached and when.
  • Explicit caching is opt-in via a caches.create() call that creates a named cache resource with a TTL. Your subsequent generation requests reference the cache by ID. This is the version to use for predictable, large savings on long stable prefixes — typical savings are 50–75% off the cached portion.

The Gemini caching minimum is meaningful: caches must be at least 1,024 tokens (Flash) or 4,096 tokens (Pro) to be useful, with cache TTLs starting at one minute and configurable up to several hours. There is also a storage cost for cached content (charged per million-tokens-per-hour) on top of the read discount — unlike Anthropic, which only charges at write/read time. For long-lived caches with many readers, the storage cost is well worth it; for short-lived or low-hit-rate caches, it can erode the savings.

The diagram below contrasts the three vendor caching models:

Diagram: Cross-Vendor Caching Comparison

Cross-Vendor Caching Comparison

Type: chart sim-id: cross-vendor-caching-comparison
Library: Chart.js
Status: Specified

Purpose: Plot the per-request cost over a sequence of N requests for the same workload on Anthropic, OpenAI (when prompt caching is enabled), and Gemini, so the structural differences in caching mechanics become visible.

Bloom Level: Analyze (L4) Bloom Verb: compare

Learning objective: Compare the caching cost models of three vendors and recommend a vendor for a workload based on prefix length, request volume, and cache lifetime.

Chart type: Multi-line chart - X-axis: Request number (1 to 50) - Y-axis: Cumulative cost ($)

Data series: 1. Anthropic (cache write at request 1 with 1.25× premium, then 0.10× reads) 2. OpenAI (when enabled, similar structure with different multipliers) 3. Gemini implicit (slow ramp, less control) 4. Gemini explicit (large drop after cache creation, plus per-hour storage cost adding small linear growth) 5. No caching baseline (linear, full price every request)

Interactive controls: - Slider: Stable prefix size (500 to 50,000 tokens) - Slider: Cache lifetime (1 minute to 24 hours) - Slider: Request frequency (per minute) - Toggle: "Show storage cost on Gemini explicit"

Data Visibility Requirements: Stage 1: Show all five lines Stage 2: Highlight the crossover points where vendor A becomes cheaper than vendor B Stage 3: Show a summary table at bottom: total cost over 50 requests, % savings vs no-cache

Default values: - Prefix: 5,000 tokens - Lifetime: 1 hour - Frequency: 1/min - Storage cost shown

Implementation: Chart.js multi-line, responsive width

Function Calling, Tools, and Code Execution

Gemini function calling matches the pattern from Claude and OpenAI: declare tools with name, description, and JSON-schema parameters; the model can emit structured tool calls; your application executes them and returns results. The implementation details:

  • Tools are declared via the tools parameter on generate_content
  • Each tool is wrapped as a FunctionDeclaration
  • The response includes function_calls when the model decides to use one

Gemini tool config is the equivalent of OpenAI's tool_choice parameter — it controls whether the model is forced to call a function, forbidden from doing so, or left to decide. The function_calling_config.mode field accepts AUTO, ANY (must call something), or NONE (must not call).

Gemini code execution is a built-in tool that lets the model execute Python code in a sandbox and use the results in its response. Enable it by including tools=[{"code_execution": {}}]. The model can write code, see the output, and incorporate it — useful for math problems, data manipulation, or chart generation. The execution tokens count toward the response total, so a single code-execution turn can cost more than a normal turn.

Gemini grounding is Google's native integration with Google Search. Enable it by including tools=[{"google_search": {}}] and the model can issue search queries and cite sources in the response. This is genuinely differentiated — neither Anthropic nor OpenAI offer comparable first-party search integration. Grounding adds tokens for the retrieved snippets and for source citations, but for any task that benefits from current web information, the cost is justified.

Multimodal Input and Safety Settings

Gemini multimodal input is the broadest of the three vendors: text, images, audio, video, and PDF documents are all supported as input content types. Video in particular is distinctive — Gemini Pro can process several minutes of video in a single request, sampling frames and audio together. Token costs vary widely by modality:

  • Images: ~250–1,000 tokens per image depending on resolution
  • Audio: ~32 tokens per second
  • Video: ~250 tokens per frame, sampled at ~1fps by default
  • PDF: text-extracted at the standard text rate, plus image rates for embedded images

Always log multimodal token costs separately by modality — they have very different cost-per-content profiles.

Gemini safety settings are configurable thresholds for content categories (harassment, hate speech, sexually explicit, dangerous content). Each category has thresholds from BLOCK_NONE to BLOCK_LOW_AND_ABOVE. The default thresholds are conservative; adjust per use case. Safety filters that fire return a finish_reason of SAFETY instead of generating output — these failed requests still cost input tokens but not output tokens.

Gemini Streaming, Batch, and Token Counting

Gemini streaming uses the generate_content_stream() method (or equivalent in your language) and yields chunks as the model generates them. The final chunk includes the full usage_metadata for cost accounting.

Gemini batch mode is the Gemini specialization of the Batch API pattern from Chapter 3. Submit a batch job (typically via Cloud Storage on Vertex), receive results within a 24-hour window, pay 50% of the synchronous price. Standard pattern; standard discount.

Gemini token counting is provided via a count_tokens() method on the SDK, which returns the exact token count for a given content blob without running the model. It is free or extremely cheap. The Google-published tokenizer is also available for fully-local counting in environments where the API call is undesirable.

The Gemini Tokenizer

The Gemini tokenizer is a SentencePiece-based tokenizer (Chapter 2) — distinct from Claude's BPE and OpenAI's tiktoken. Practical implications:

  • Slightly more efficient on multilingual content than English-trained BPE tokenizers
  • Slightly less efficient on long English prose than tiktoken's o200k_base
  • Different vocabulary, so the same string produces a different token count than Claude or OpenAI

Which brings us to the synthesis concept this chapter has been building toward.

Cross-Vendor Tokenizer Drift

Cross-vendor tokenizer drift is the observation that the same string of text produces different token counts on each vendor's tokenizer — sometimes by 20–30%. This is the single biggest gotcha in cross-vendor cost comparison.

Concrete example: the sentence "Pemba audited the cache hit rate and recommended a 90% reduction." tokenizes to roughly:

  • Claude tokenizer: ~17 tokens
  • tiktoken (cl100k): ~16 tokens
  • tiktoken (o200k): ~14 tokens
  • Gemini tokenizer: ~15 tokens

A 21% spread on a one-line sentence. Multiplied across millions of requests, the per-vendor token-count differences cascade into per-vendor cost differences that are not visible from the rate card alone.

The implication for cross-vendor benchmarking: never compare $/MTok rates directly. Always:

  1. Take a representative sample of your real production traffic
  2. Tokenize it with each vendor's tokenizer
  3. Multiply each tokenization by the vendor's input price
  4. Compare the resulting per-million-real-requests cost — not the headline price

A vendor that looks 20% cheaper per token may actually be 5% more expensive on your traffic shape. Or 40% cheaper, depending. Tokenizer drift is invisible unless you measure for it.

Diagram: Cross-Vendor Token Count Drift

Cross-Vendor Token Count Drift

Type: microsim sim-id: cross-vendor-token-drift
Library: p5.js
Status: Specified

Purpose: Let learners type or paste text and see the same content tokenized by each vendor's tokenizer side by side, with the resulting token counts and a per-vendor cost estimate.

Bloom Level: Evaluate (L5) Bloom Verb: assess

Learning objective: Assess which vendor offers the lowest cost for a given content shape by comparing tokenizations across all three vendors.

Canvas layout: - Top: Multi-line text input (default: a paragraph mixing English, code, and a Japanese sentence) - Middle: Three columns side by side, one per vendor (Claude, OpenAI, Gemini), each showing token chips and a token count - Bottom: A bar chart showing per-vendor cost for the input text at each vendor's published input price

Interactive controls: - Text area - Buttons: "Load: English prose" / "Load: Code snippet" / "Load: Multilingual" / "Load: Long document (~10K tokens)" - Dropdown per column: which model tier to use for cost calculation (e.g. Claude Sonnet, GPT-4o, Gemini Pro)

Data Visibility Requirements: Stage 1: Show three side-by-side tokenizations with chips Stage 2: Show the count delta below each column ("Claude: 17 tokens", etc.) Stage 3: Show the per-vendor cost calculation at the bottom Stage 4: Highlight the cheapest vendor in green for the current input

Default state: English prose example, all three columns visible

Implementation: p5.js with responsive width; uses simplified tokenization simulators for each vendor (illustrative, not bit-exact)

Compare on Real Traffic, Not on Rate Cards

Pemba cautioning The most expensive lesson in vendor selection is "we picked Vendor A because it's $2/MTok cheaper, then our bill came in higher than the alternative." Always tokenize a real sample of your traffic with each vendor's actual tokenizer before deciding. Where did all the tokens go? Sometimes: into a tokenizer mismatch you didn't measure for.

Putting It All Together

You can now build Gemini applications and you have all three vendors mapped. You authenticate against the Google Gemini API through either the Google AI Studio surface (developer-friendly, simple API keys) or Vertex AI (enterprise, GCP IAM, data residency), using the Gemini SDK (google-genai). You select a model from the Gemini model familyGemini Flash for high-throughput, Gemini Pro for general reasoning, Gemini Ultra for the hardest tasks. You exploit the long context window and the one million context mode when document scale demands it. You use Gemini caching (implicit for free savings, explicit for predictable savings on large prefixes). You add Gemini function calling with Gemini tool config, and the differentiated capabilities Gemini code execution and Gemini grounding. You handle Gemini multimodal input (text, image, audio, video, PDF) with appropriate per-modality token accounting and configure Gemini safety settings for your use case. You stream with Gemini streaming, batch with Gemini batch mode, and count tokens locally with Gemini token counting and the Gemini tokenizer. And — most importantly across all three vendors — you account for cross-vendor tokenizer drift when comparing costs.

Chapter 7 leaves the API world and enters the harness world: how Claude Code, Codex CLI, and Antigravity accumulate tokens across multi-turn agent sessions.

Quick Self-Check — Click to Reveal Answers
  1. What's the main practical difference between AI Studio and Vertex AI? AI Studio is developer-friendly with simple API keys; Vertex is enterprise-grade with GCP IAM, data residency, and VPC-SC support.
  2. What does "1M context" enable that 200K does not? Single-shot whole-document or whole-codebase analysis without an external retrieval system.
  3. What's the difference between implicit and explicit Gemini caching? Implicit is automatic with no code changes; explicit creates a named cache resource you reference, with predictable larger savings but added storage cost.
  4. What is cross-vendor tokenizer drift? The same text produces different token counts on each vendor's tokenizer — sometimes by 20–30% — so per-token rate comparisons are misleading without a real-traffic measurement.
  5. What's distinctive about Gemini grounding? First-party Google Search integration as a built-in tool — neither Anthropic nor OpenAI offer comparable native search.

End of Chapter 6

Pemba celebrating All three vendors mapped: Claude, OpenAI, Gemini. You can read any of their APIs, structure prompts to cache, and compare costs on real traffic. The next chapter shifts gears — from raw API calls to agentic harnesses, where token costs accumulate across whole sessions instead of single requests.


See Annotated References