Large Language Models and Tokenization
Summary
This chapter introduces large language models (LLMs), the powerful AI systems that enable modern conversational agents to understand and generate human-like text. You will learn about transformer architecture, the attention mechanism that makes LLMs effective, and the critical process of tokenization that converts text into units processable by neural networks. These concepts form the foundation for understanding how chatbots generate intelligent responses.
Concepts Covered
This chapter covers the following 7 concepts from the learning graph:
- Large Language Model
- Transformer Architecture
- Attention Mechanism
- Token
- Tokenization
- Subword Tokenization
- Byte Pair Encoding
Prerequisites
This chapter builds on concepts from:
Introduction: The Revolution in Language Understanding
When you interact with ChatGPT, Claude, or modern chatbots, you're experiencing technology that would have seemed like science fiction just a decade ago. These systems don't just match keywords or follow scripts—they genuinely understand context, generate coherent paragraphs, answer follow-up questions, and adapt their responses to your needs. The technology powering this capability is called large language models (LLMs), neural networks trained on vast amounts of text that have learned remarkably sophisticated patterns of human language.
Understanding LLMs is essential for building modern conversational AI systems. While you likely won't train an LLM from scratch (that requires millions in compute resources), you will use pre-trained LLMs as components in your chatbot architecture. Knowing how they work—from the tokenization that converts text into processable units, through the transformer architecture that processes those tokens, to the attention mechanism that enables contextual understanding—allows you to use these tools effectively, debug issues, and make informed architectural decisions.
This chapter focuses on the foundational concepts: what LLMs are, how the transformer architecture that powers them works, and how tokenization prepares text for processing. These concepts form the bedrock for understanding retrieval-augmented generation (RAG), embeddings, and other advanced techniques covered in later chapters.
What Are Large Language Models?
A large language model is a neural network trained on enormous text corpora (often hundreds of billions or trillions of words) to predict what comes next in a sequence. At its core, an LLM is doing something conceptually simple: given text like "The capital of France is," it predicts the next word should probably be "Paris." However, the scale of training data and model parameters (often hundreds of billions of parameters) allows these models to learn incredibly nuanced patterns about grammar, facts, reasoning, and even writing style.
What makes modern LLMs "large"? Three dimensions of scale:
- Parameter count: GPT-3 has 175 billion parameters, Claude models have hundreds of billions, and some models exceed a trillion parameters. Each parameter is a learned weight in the neural network.
- Training data: Models are trained on datasets containing hundreds of billions to trillions of words scraped from the internet, books, articles, and code repositories.
- Compute resources: Training state-of-the-art LLMs requires thousands of GPUs running for weeks or months, costing millions of dollars in compute time.
The "large" aspect isn't just about bragging rights—larger models demonstrably exhibit emergent capabilities that smaller models lack. GPT-2 (1.5 billion parameters) struggles with multi-step reasoning; GPT-3 (175 billion) can solve many reasoning problems; GPT-4 and Claude Sonnet show even stronger reasoning, planning, and instruction-following capabilities. This scaling phenomenon, where quantitative increases in size lead to qualitative improvements in capability, has driven the recent AI revolution.
For conversational AI applications, LLMs provide several critical capabilities:
- Natural language understanding: Interpreting user questions even when phrased ambiguously or colloquially
- Context retention: Maintaining conversational context across multiple turns
- Knowledge access: Retrieving factual information encoded during training (though with limitations on recency and accuracy)
- Text generation: Producing fluent, contextually appropriate responses
- Instruction following: Adhering to system prompts that define chatbot behavior and personality
However, LLMs also have important limitations you must understand to use them effectively. They have knowledge cutoff dates (training data only goes up to a certain point in time), they can "hallucinate" plausible-sounding but false information, they struggle with precise arithmetic, and they can't access real-time information or private organizational data unless explicitly provided through techniques like RAG (covered in Chapter 8).
Understanding Tokens: The Building Blocks of Language Processing
Before an LLM can process text, it must convert that text into numbers—neural networks operate on numerical tensors, not characters or words. The fundamental unit of text that LLMs work with is called a token. A token might be a whole word, part of a word, a punctuation mark, or even individual characters, depending on the tokenization scheme.
Consider the sentence "Database administrators use backup tools." Different tokenization approaches might split this into tokens differently:
- Word-based tokenization: ["Database", "administrators", "use", "backup", "tools", "."]
- Character-based tokenization: ["D", "a", "t", "a", "b", "a", "s", "e", " ", "a", "d", "m", "i", "n", "i", "s", "t", "r", "a", "t", "o", "r", "s", ...]
- Subword tokenization (typical for modern LLMs): ["Database", " administrators", " use", " backup", " tools", "."]
- More aggressive subword: ["Data", "base", " admin", "istr", "ators", " use", " back", "up", " tools", "."]
Why does this matter? Token choice has major implications:
- Vocabulary size: Word-based tokenization requires huge vocabularies (100,000+ entries) to cover all words including rare ones. Character-based needs tiny vocabularies (~100 characters) but creates very long sequences.
- Sequence length: Character tokenization turns a 10-word sentence into 50+ tokens; subword tokenization typically produces 15-20 tokens. Longer sequences require more computation and memory.
- Out-of-vocabulary handling: Word tokenization struggles with new words, names, or typos. Character and subword approaches handle arbitrary text.
- Semantic granularity: Word tokens preserve semantic units; character tokens lose word boundaries; subword strikes a balance.
Modern LLMs almost universally use subword tokenization, which splits text into frequently-occurring chunks that may be whole words, common prefixes/suffixes, or individual characters for rare sequences. This approach combines the advantages of word and character tokenization: reasonable vocabulary size (30,000-100,000 tokens), manageable sequence lengths, and robust handling of rare words.
Understanding tokens is crucial for practical LLM usage because:
- API pricing is often per token (e.g., $0.01 per 1000 tokens), so knowing that "internationalization" might be 4 tokens while "i18n" is 3 tokens affects cost
- Context windows are measured in tokens (e.g., 8k, 32k, 128k tokens), limiting how much text you can process at once
- Token limits affect both input (how much context you provide) and output (how long responses can be)
The Tokenization Process
Tokenization is the process of converting raw text strings into sequences of tokens that can be mapped to numerical IDs for neural network processing. For modern LLMs, this is a multi-step pipeline:
- Normalization: Clean and standardize input text (handle Unicode, normalize whitespace, optionally lowercase)
- Pre-tokenization: Split text into rough chunks (often by whitespace and punctuation)
- Subword segmentation: Apply the learned tokenization algorithm (like BPE) to split chunks into tokens
- Mapping to IDs: Convert each token string to its corresponding integer ID in the vocabulary
- Adding special tokens: Insert tokens like
[START],[END], or[SEP]that mark boundaries
The process is deterministic and reversible (with some caveats around normalization). Given text, you always get the same token sequence. Given token IDs, you can decode back to (approximately) the original text.
Different LLMs use different tokenizers, which means the same text tokenizes differently across models:
- GPT models (OpenAI): Use Byte Pair Encoding with vocabulary ~50,000-100,000 tokens
- BERT models: Use WordPiece tokenization with vocabulary ~30,000 tokens
- LLaMA models: Use Sentence Piece (a variant of BPE) with vocabulary ~32,000 tokens
- Claude models: Use Byte Pair Encoding with vocabulary optimized for code and multilingual text
This lack of standardization means you can't directly transfer tokenized data between models—each requires its own tokenization using its specific vocabulary.
MicroSim: Interactive Tokenization Explorer
1 2 | |
Learning objective: Understand how text is split into tokens and visualize differences between tokenization approaches
Canvas layout (1000x700px): - Top section (1000x200): Text input area and tokenization method selection - Middle section (1000x400): Tokenized output visualization with color-coded tokens - Bottom section (1000x100): Statistics and metrics display
Visual elements: - Large text input box (expandable) - Tokenized text display showing each token as a colored box with the token text inside - Token boundaries clearly visible - Hovering over token shows its ID and position in sequence - Statistics panel showing: * Total characters in input * Total tokens produced * Average characters per token * Vocabulary coverage (% of tokens that are whole words vs. subwords)
Sample input text (default): "The database administrators use PostgreSQL for backup and recovery. They're implementing continuous archiving."
Interactive controls: - Text input: User can type or paste any text - Radio buttons: Select tokenization method * Word-based (split on whitespace/punctuation) * Character-based (one char = one token) * Subword (simulated BPE-like behavior) * GPT-style (approximation of GPT tokenizer) - Checkbox: "Show token IDs" (displays numeric ID under each token) - Checkbox: "Highlight special characters" (shows spaces, punctuation specially) - Slider: Subword aggressiveness (for subword mode: 1=conservative/mostly words, 10=aggressive/more splits)
Example tokenization outputs for default text:
Word-based (12 tokens): ["The", "database", "administrators", "use", "PostgreSQL", "for", "backup", "and", "recovery", ".", "They're", "implementing", "continuous", "archiving", "."]
Character-based (127 tokens): ["T", "h", "e", " ", "d", "a", "t", "a", "b", "a", "s", "e", ...] (too many to show)
Subword (aggressiveness=5, ~18 tokens): ["The", " database", " admin", "istr", "ators", " use", " Post", "gre", "SQL", " for", " backup", " and", " recovery", ".", " They", "'re", " implementing", " continuous", " arch", "iving", "."]
GPT-style (~16 tokens): ["The", " database", " administrators", " use", " PostgreSQL", " for", " backup", " and", " recovery", ".", " They", "'re", " implementing", " continuous", " archiving", "."]
Behavior: - As user types in text box, tokenization updates in real-time - Each token rendered as a colored rounded rectangle - Color scheme: * Whole words (common): Light blue * Subwords (prefixes/suffixes): Light green * Punctuation/special: Light orange * Whitespace tokens: Very light gray with visible space marker - Clicking a token highlights it and shows detailed info: * Token text * Token ID (simulated: 0-49999) * Position in sequence * Character span in original text * Token type (word/subword/punctuation/special) - Statistics update automatically: * Show tokens-to-characters ratio * Compare across tokenization methods (show all 4 counts side-by-side)
Educational features: - Preset example buttons: * "Short sentence" (5-10 tokens) * "Technical jargon" (shows how rare terms split) * "Multilingual" (shows how non-English text tokenizes) * "Code snippet" (shows tokenization of programming code) * "Very long word" (e.g., "internationalization") - Comparison mode: Split screen showing two tokenization methods side-by-side - Token economy calculator: Shows estimated API cost based on token count at $0.01/1k tokens
Educational annotations: - When character-based selected: "Notice: 127 tokens for just 2 sentences! Character tokenization creates very long sequences." - When word-based selected: "Word tokenization treats 'They're' as one token, but can't handle 'unknown_word_xyz'" - When subword selected: "Subword balances sequence length and vocabulary coverage. Common words stay whole; rare words split." - When user enters a very long word: "Long/rare words split into subwords: 'inter-national-ization' → ['inter', 'national', 'ization']"
Special demonstration: - "Token boundary impact" button: Shows how changing one character can affect entire tokenization * Before: "The administrators use tools" → ["The", " administrators", " use", " tools"] * After: "The adminstrators use tools" (typo: removed 'i') → ["The", " admin", "str", "ators", " use", " tools"] * Annotation: "Notice how the typo changed tokenization! This affects model processing."
Implementation notes: - Use p5.js for rendering - Implement simplified BPE algorithm for subword tokenization - Use word boundary regex for word tokenization - Character tokenization is straightforward array split - Store pre-defined vocabularies for realistic token ID assignment - Use rect() with rounded corners and text() for token visualization - Color code based on token type detection (heuristics: length, position, characters) - Implement hover tooltips with token details
The tokenization process is largely invisible to end users but critical for developers. When you send a request to an LLM API, the first thing that happens is tokenization. Understanding this process helps you optimize prompts (shorter tokens = lower cost), debug issues (why did the model treat "New York" as 2 tokens or 3?), and architect systems that stay within token limits.
Subword Tokenization and Byte Pair Encoding
Subword tokenization represents the dominant approach in modern NLP, splitting text into units smaller than words but larger than characters. The core idea: frequently-occurring sequences (like common words) should be single tokens, while rare sequences (like uncommon words or names) can be split into smaller pieces that appear more frequently.
The most popular subword tokenization algorithm is Byte Pair Encoding (BPE), originally a data compression technique adapted for NLP. BPE learns which character sequences to merge based on frequency in the training corpus:
BPE Algorithm:
- Start with a vocabulary containing all individual characters (base vocabulary)
- Count all adjacent character pairs in the training corpus
- Merge the most frequent pair into a new token, add to vocabulary
- Repeat steps 2-3 for a fixed number of iterations (e.g., 30,000-50,000 merges)
- The resulting vocabulary contains characters + learned subword units
Example BPE learning process:
Starting corpus: "low", "lower", "lowest", "newer", "wider"
Initial vocabulary: [l, o, w, e, r, n, i, d, s, t]
Iteration 1: Most frequent pair is "e r" (appears in "lower", "newer", "wider") - Merge "e r" → "er" - Vocabulary: [l, o, w, e, r, n, i, d, s, t, er]
Iteration 2: Most frequent pair is now "l o" (appears in "low", "lower", "lowest") - Merge "l o" → "lo" - Vocabulary: [l, o, w, e, r, n, i, d, s, t, er, lo]
Iteration 3: Most frequent pair is "lo w" - Merge "lo w" → "low" - Vocabulary: [l, o, w, e, r, n, i, d, s, t, er, lo, low]
After many iterations: [l, o, w, e, r, n, i, d, s, t, er, lo, low, lower, est, lowest, new, newer, wid, wider, ...]
Now when tokenizing new text like "lowest newer", it becomes: ["lowest", " new", "er"] using the learned vocabulary.
The beauty of BPE is that it automatically learns useful subword units from the training data:
- Common words like "the", "and", "database" become single tokens
- Common prefixes/suffixes like "un-", "re-", "-ing", "-tion" become tokens
- Rare words split into recognizable pieces: "PostgreSQL" → ["Post", "gre", "SQL"]
- Unknown words can always be represented using character fallback
For multilingual models, BPE learns useful subwords across languages. The token "ation" appears in English words ("demonstration"), French words ("nation"), Spanish words ("nación" → "naci", "ón"), enabling some cross-linguistic knowledge transfer.
Diagram: Byte Pair Encoding Merge Process
1 2 | |
Purpose: Illustrate how BPE iteratively merges character pairs to build subword vocabulary
Components:
- Initial State (left side):
- Training corpus display showing example words:
- "database" (repeated 100 times in corpus - shown as "database × 100")
- "data" (repeated 80 times)
- "backup" (repeated 90 times)
- "based" (repeated 70 times)
- Character-level tokenization shown:
- "database" → [d, a, t, a, b, a, s, e]
- "data" → [d, a, t, a]
- "backup" → [b, a, c, k, u, p]
- "based" → [b, a, s, e, d]
-
Initial vocabulary box (bottom): [a, b, c, d, e, k, p, s, t, u]
-
Pair Frequency Analysis (middle section):
- Table showing most frequent character pairs: | Pair | Frequency | Source Words | |------|-----------|--------------| | "da" | 180 | database(100), data(80) | | "ta" | 180 | database(100), data(80) | | "ba" | 160 | database(100), backup(90), based(70) | | "se" | 170 | database(100), based(70) |
-
Highlight most frequent pair "da" or "ta" (tied at 180)
-
Merge Operation (iteration arrows):
-
Iteration 1: Merge "da" → "da"
- New vocabulary: [a, b, c, d, e, k, p, s, t, u, da]
- Updated tokenization:
- "database" → [da, t, a, b, a, s, e]
- "data" → [da, t, a]
-
Iteration 2: Merge "ta" → "ta"
- New vocabulary: [..., da, ta]
- Updated tokenization:
- "database" → [da, ta, b, a, s, e]
- "data" → [da, ta]
-
Iteration 3: Merge "data" (now a pair!) → "data"
- New vocabulary: [..., da, ta, data]
- Updated tokenization:
- "database" → [data, b, a, s, e]
- "data" → [data]
-
Iteration 4: Merge "ba" → "ba"
- Iteration 5: Merge "base" → "base"
-
...continue
-
Final State (right side):
- Learned vocabulary (after N iterations):
- Character tokens: [a, b, c, d, e, k, p, s, t, u]
- Subword tokens: [da, ta, data, ba, se, base, database, back, up, backup, ...]
-
Final tokenization examples:
- "database" → [database] (one token!)
- "data" → [data] (one token!)
- "backup" → [backup] (one token!)
- "databases" → [database, s] (unknown suffix splits)
-
Visual Flow (arrows):
- Top-to-bottom flow showing progression through iterations
- Each iteration box shows:
- Which pair is being merged
- Updated vocabulary size
- Sample tokenizations after merge
Layout: Left-to-right flow with vertical iteration steps
Visual style: Flowchart with boxes for vocabulary states, tables for frequency analysis, and arrows showing merges
Color scheme: - Characters: Light gray boxes - Subword tokens learned in early iterations: Light blue - Subword tokens learned in later iterations: Darker blue - Complete words that became tokens: Dark green - Arrows showing merges: Orange with merge symbol
Labels: - "Initial Vocabulary (10 characters)" - "Iteration 1: Merge 'da' (freq=180)" - "Iteration 2: Merge 'ta' (freq=180)" - "After N iterations: Vocabulary size = 30,000" - "Common words = single tokens" - "Rare words = split into learned subwords"
Annotations: - "BPE automatically learns useful subwords from corpus statistics" - "Frequency-based merging ensures common patterns become tokens" - "Unknown words can always be represented using character fallback"
Implementation: SVG diagram or created with flowchart/diagram tools (draw.io, Lucidchart, or programmatically)
Variants of BPE exist:
- WordPiece: Used by BERT, similar to BPE but merges based on likelihood rather than frequency
- SentencePiece: Treats input as raw Unicode, doesn't require pre-tokenization, handles any language
- Unigram Language Model: Probabilistic approach that starts with large vocabulary and prunes
For chatbot developers, the key insight is that BPE tokenization is already done for you by the LLM provider. You don't train your own tokenizer. However, understanding BPE helps you:
- Predict how text will tokenize (estimate token counts)
- Understand why certain inputs behave unexpectedly
- Optimize prompts to minimize token usage
- Debug issues where model behavior depends on tokenization boundaries
The Transformer Architecture: The Foundation of Modern LLMs
The transformer architecture, introduced in the 2017 paper "Attention Is All You Need," revolutionized natural language processing and enabled the current generation of large language models. Unlike earlier recurrent neural networks (RNNs) that processed text sequentially word-by-word, transformers process entire sequences in parallel using attention mechanisms—making them both more powerful and more efficient to train.
The transformer architecture consists of several key components working together:
Core Components:
-
Input Embedding Layer: Converts token IDs to high-dimensional vectors (typically 768, 1024, or higher dimensions)
-
Positional Encoding: Adds information about token position in the sequence (since attention doesn't inherently understand order)
-
Multi-Head Self-Attention Layers: Allow each token to attend to all other tokens in the sequence, building contextual understanding
-
Feed-Forward Neural Networks: Process each token's representation independently after attention
-
Layer Normalization: Stabilizes training by normalizing activations
-
Residual Connections: Allow gradients to flow through deep networks effectively
The original transformer had two parts: an encoder (for understanding input) and a decoder (for generating output). Modern LLMs use different variants:
- Encoder-only (like BERT): Good for understanding and classification tasks
- Decoder-only (like GPT, Claude): Good for generation tasks, used for chatbots
- Encoder-decoder (like T5): Good for translation and summarization tasks
Most conversational AI systems use decoder-only transformers because chatbot applications focus on generating responses given conversational context. These models are trained with a "causal" or "autoregressive" approach: predict the next token given all previous tokens.
The architecture allows stacking many layers (GPT-3 has 96 layers, some models exceed 100 layers). Each layer refines the representation of each token based on context from other tokens. Early layers learn simple patterns (syntax, basic word relationships); deeper layers learn complex patterns (reasoning, world knowledge, nuanced semantics).
Diagram: Transformer Architecture for Language Models
1 2 | |
Purpose: Show the flow of information through a decoder-only transformer architecture used in modern LLMs
Components (vertical stack, bottom to top):
- Input Layer (bottom):
- Input text: "The database is"
- Token IDs: [464, 14983, 318]
-
Dimension: [sequence_length × 1]
-
Token Embedding Layer:
- Lookup table converting token IDs to vectors
- Each token becomes a vector (e.g., 768 dimensions)
- Visual: Show 3 token IDs expanding to 3 dense vectors
-
Dimension: [sequence_length × embedding_dim]
-
Positional Encoding:
- Add position information to embeddings
- Visual: Position vectors [0], [1], [2] added to token embeddings
- Formula shown: PE(pos, i) = sin/cos based encoding
-
Dimension: [sequence_length × embedding_dim]
-
Transformer Block 1 (first of N blocks):
4a. Multi-Head Self-Attention: - Show "The" attending to ["The", "database", "is"] - Show "database" attending to ["The", "database", "is"] - Show "is" attending to ["The", "database", "is"] - Multiple attention heads (e.g., 12 heads) shown in parallel - Visual: Arrows from each token to all previous tokens (causal masking) - Dimension: [sequence_length × embedding_dim]
4b. Add & Normalize: - Residual connection (skip connection shown as curved arrow) - Layer normalization
4c. Feed-Forward Network: - Two-layer MLP - Expansion: 768 → 3072 → 768 (typical 4× expansion) - ReLU/GELU activation
4d. Add & Normalize: - Another residual connection - Layer normalization
Output: Refined token representations
- Transformer Block 2 ... Block N:
- Show vertical stack with "..." indicating many layers
- Label: "Repeated N times (e.g., N=96 for GPT-3)"
-
Note: "Each layer refines representations"
-
Final Layer Normalization:
- Normalize final hidden states
-
Dimension: [sequence_length × embedding_dim]
-
Output Projection Layer (Language Model Head):
- Linear layer projecting to vocabulary size
- Dimension: [sequence_length × vocab_size]
-
Example: 768 → 50,000 (for each token position)
-
Softmax & Sampling (top):
- Softmax over vocabulary for last position
- Probability distribution: P("backup"|"The database is") = 0.23, P("offline"|...) = 0.15, ...
- Sample or take argmax to select next token
- Visual: Bar chart of top 5 token probabilities
Visual Flow: - Arrows showing upward flow of information - Highlight the autoregressive property: "is" only attends to "The" and "database" (not future tokens) - Show residual connections as curved arrows bypassing blocks
Detailed callout boxes:
- Self-Attention Detail (expandable):
- Query, Key, Value matrices
- Attention formula: Attention(Q,K,V) = softmax(QK^T / √d_k)V
-
Visual matrix multiplication diagram
-
Positional Encoding Detail (expandable):
- Why needed: "Attention is order-agnostic without position info"
- Sinusoidal encoding visualization
-
Alternative: Learned positional embeddings
-
Causal Masking (expandable):
- Attention mask matrix showing which positions can attend to which
- Lower triangular matrix (can only attend to current and previous positions)
- Why: Ensures autoregressive property (no "cheating" by looking ahead)
Color scheme: - Input/Output layers: Light yellow - Embedding layers: Light blue - Attention mechanisms: Green (the key innovation) - Feed-forward networks: Purple - Normalization layers: Light gray - Residual connections: Orange curved arrows
Annotations: - "Parallel processing: All tokens processed simultaneously (unlike RNNs)" - "Self-attention: Each token attends to context from other tokens" - "Deep stacking: GPT-3 uses 96 layers; Claude uses 100+ layers" - "Causal masking: Token N can only see tokens 1..N, not future tokens" - "Output: Probability distribution over next token"
Dimensions shown: - Sequence length: 3 (in example) - Embedding dimension: 768 - Feed-forward hidden: 3072 - Number of heads: 12 - Number of layers: N (e.g., 96) - Vocabulary size: 50,000
Implementation: Create as detailed architecture diagram using draw.io, Lucidchart, or similar tools. Include matrix dimensions at each stage to help understanding.
For chatbot developers, you don't need to implement transformer architecture yourself—you use pre-trained models through APIs or libraries. However, understanding the architecture helps you:
- Understand context windows: The attention mechanism processes all tokens simultaneously, but this requires O(n²) memory and compute in sequence length. This is why models have context limits (8k, 32k, 128k tokens).
- Appreciate why LLMs are expensive: Each layer does massive matrix multiplications. A single forward pass through GPT-3 involves trillions of arithmetic operations.
- Debug behavior: Understanding that models are autoregressive (generate one token at a time) explains why they can get stuck in loops or gradually drift off-topic in long generations.
- Optimize performance: Knowing that longer contexts require quadratic computation explains why you should keep prompts concise.
The Attention Mechanism: Learning What Matters
The attention mechanism is the key innovation that makes transformers powerful. At its core, attention allows the model to dynamically focus on different parts of the input when processing each token. When processing the word "it" in "The database crashed and it needs recovery," attention allows the model to focus on "database" to understand what "it" refers to—even though they're separated by other words.
How Attention Works:
For each token, the attention mechanism computes three vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I represent?"
- Value (V): "What information do I contain?"
The attention score between two tokens is computed by:
- Dot product of Query of token A with Key of token B
- Scale by √(dimension) to prevent large values
- Apply softmax to get attention weights (sum to 1)
- Weighted sum of Values using attention weights
Mathematically:
1 | |
Where: - Q is the query matrix - K is the key matrix - V is the value matrix - d_k is the dimension of keys (used for scaling)
Multi-Head Attention runs multiple attention operations in parallel (e.g., 12 or 16 heads), each focusing on different aspects of the relationships:
- Head 1 might learn syntactic relationships (subject-verb-object)
- Head 2 might learn coreference (what pronouns refer to)
- Head 3 might learn semantic relationships (related concepts)
- Head 4 might learn positional patterns (nearby words)
Each head learns its own Query, Key, Value projections during training, allowing specialization. The outputs from all heads are concatenated and projected back to the original dimension.
Causal (Masked) Attention for language models ensures that when predicting token N, the model can only attend to tokens 1 through N-1, not future tokens. This is implemented by masking out (setting to -∞ before softmax) attention scores to future positions. Without this masking, the model could "cheat" during training by looking at the answer.
MicroSim: Attention Mechanism Visualizer
1 2 | |
Learning objective: Visualize how attention weights distribute across tokens and understand multi-head attention
Canvas layout (1200x800px): - Top section (1200x150): Input sentence with selectable tokens - Middle section (1200x500): Attention visualization matrix and head selector - Bottom section (1200x150): Attention score details and controls
Visual elements: - Input sentence displayed with each token in a box - Attention heatmap showing attention weights from each token to all others - Multiple attention heads (selectable tabs or dropdown) - Attention weight values displayed on hover - Color gradient from white (low attention) to dark blue (high attention)
Sample input sentence: "The database administrator restored the backup because the system crashed"
Tokens (10 tokens): ["The", " database", " administrator", " restored", " the", " backup", " because", " the", " system", " crashed"]
Interactive controls: - Click any token to see its attention distribution - Radio buttons: Select attention head (Head 1, Head 2, ..., Head 12, or "Average All Heads") - Checkbox: "Show causal mask" (grays out future token attention) - Slider: "Attention temperature" (sharpens or smooths attention distribution) - Button: Preset sentences: * "Simple subject-verb-object" * "Pronoun resolution example" * "Long-distance dependency" * "Complex nested sentence"
Attention visualization modes:
- Matrix View (default):
- 10×10 grid (for 10-token sentence)
- Rows: Query tokens (which token is attending)
- Columns: Key tokens (which token is being attended to)
- Cell color intensity: Attention weight (0=white, 1=dark blue)
- Hover over cell: Show exact attention score
-
Row sums to 1.0 (softmax normalization)
-
Arc Diagram View:
- Sentence displayed horizontally
- Curved arcs connecting tokens
- Arc thickness proportional to attention weight
- Selected token shows all its outgoing attention arcs
-
Color: Blue arcs for strong attention (>0.2), gray for weak
-
Attention Flow Animation:
- Animated particles flowing from query token to key tokens
- Particle count proportional to attention weight
- Helps visualize "where attention flows"
Example attention patterns to demonstrate:
Head 1 (Syntactic head - learns subject-verb relationships): - "administrator" attends strongly to "The" and "database" (its modifiers) - "restored" attends strongly to "administrator" (subject) - "crashed" attends strongly to "system" (subject)
Head 2 (Coreference head - learns pronoun resolution): - "the" (second occurrence, before "backup") attends to "restored" (verb determining definiteness) - "the" (third occurrence, before "system") attends to "because" and "crashed" (determining which system)
Head 3 (Positional head - learns nearby word relationships): - Each token attends strongly to immediately adjacent tokens - Smooth decay in attention with distance
Head 4 (Semantic head - learns meaning relationships): - "backup" attends to "database", "restored" (semantically related) - "crashed" attends to "system", "database" (failure context) - "because" attends to both clauses (causal relationship)
Specific demonstration for selected token "restored" (index 3):
Attention distribution (Head 1): - "The" (index 0): 0.05 - "database" (index 1): 0.15 - "administrator" (index 2): 0.45 (strong - subject of verb) - "restored" (index 3): 0.10 (self-attention) - "the" (index 4): 0.08 - "backup" (index 5): 0.12 (object of verb) - "because" (index 6): 0.03 - (indices 7-9 masked to 0 if causal mask enabled)
Visual display: - Bar chart showing attention weights for selected token - Heatmap row highlighted for selected token - Top-3 attended tokens highlighted in sentence
Educational features:
- Causal Mask Demonstration:
- Toggle "Show causal mask" on/off
- When enabled, grays out upper-right triangle of matrix
- Annotation: "Causal masking prevents attending to future tokens during training"
-
Show how this affects attention distribution (attention redistributes to available tokens)
-
Multi-Head Comparison:
- Side-by-side view of 2-3 attention heads for same query token
- Highlight how different heads learn different patterns
-
Annotation: "Head 1 focuses on syntax, Head 2 on semantics, Head 3 on position"
-
Temperature Effect:
- Slider adjusts softmax temperature
- Low temp (<1.0): Sharper attention (focuses on few tokens)
- High temp (>1.0): Smoother attention (distributes more evenly)
-
Formula shown: softmax(scores / temperature)
-
Attention Score Calculation Display:
- When token clicked, show step-by-step calculation:
1 2 3 4 5 6 7 8 9 10 11 12
Token: "restored" (position 3) Step 1: Compute Query vector Q[3] (768-dim, shown as [0.23, -0.45, ...]) Step 2: Compute dot products with all Key vectors K[0]...K[9] Q[3] · K[0] = 12.4 Q[3] · K[1] = 18.7 Q[3] · K[2] = 45.2 (highest - "administrator") ... Step 3: Scale by √d_k = √64 = 8 Scores: [1.55, 2.34, 5.65, ...] Step 4: Apply softmax → [0.05, 0.15, 0.45, ...] Step 5: Weighted sum of Values using attention weights
Behavior: - Real-time updates as controls change - Smooth transitions between attention heads - Tooltips explaining each component - Responsive highlighting when hovering over tokens or attention cells
Implementation notes: - Use p5.js for rendering - Pre-compute realistic attention patterns for demo heads (don't need real transformer) - Implement attention score calculation with simplified Q, K, V vectors - Use color interpolation (lerp) for heatmap gradient - Draw arcs using bezier() for arc diagram view - Implement softmax function for attention weight calculation - Store attention patterns for multiple heads and sentence examples
Understanding attention is crucial for working with LLMs because:
- Context limits exist: Attention requires O(n²) memory/compute, limiting how many tokens models can process
- Long-range dependencies work: Unlike RNNs that struggle with distant relationships, attention can connect tokens regardless of distance
- Interpretability: Attention weights can sometimes (though not always) reveal what the model is "focusing on"
- Prompt design matters: The model attends to your entire prompt when generating each token, so prompt structure affects output
Putting It All Together: From Text to Intelligence
Modern conversational AI systems combine tokenization, transformer architecture, and attention mechanisms into a complete pipeline:
-
User input: "How do I restore a PostgreSQL database from backup?"
-
Tokenization: Convert to tokens → ["How", " do", " I", " restore", " a", " Post", "gre", "SQL", " database", " from", " backup", "?"] (12 tokens)
-
Embedding: Each token → 768-dimensional vector
-
Positional encoding: Add position information to embeddings
-
Transformer layers (e.g., 96 layers): Each token's representation is refined by attending to all previous tokens and passing through feed-forward networks
-
Output layer: Project final hidden state to vocabulary size, producing probability distribution over next tokens
-
Sampling/Generation: Select next token (e.g., "To"), append to sequence, repeat steps 3-7 until complete response generated
-
Detokenization: Convert token IDs back to text for display to user
This process happens for every token generated. A 200-token response requires 200 forward passes through the entire transformer architecture. This is why:
- Latency varies with response length: Longer responses take longer to generate (roughly linear in output length)
- Streaming is possible: Models can output tokens as they're generated rather than waiting for the complete response
- Costs scale with tokens: Both input (context) and output (generation) tokens consume compute
For building conversational AI applications:
- Use pre-trained LLMs: Training from scratch costs millions and requires massive datasets; use models from OpenAI, Anthropic, Google, Meta, etc.
- Fine-tune when needed: For specialized domains, fine-tuning pre-trained models on your data can improve performance
- Combine with retrieval: LLMs have knowledge limits; RAG (Chapter 8) combines LLM generation with information retrieval from your knowledge base
- Monitor token usage: Both for cost management and to stay within context windows
- Understand limitations: LLMs can hallucinate, have knowledge cutoffs, and struggle with precise arithmetic or recent events
Key Takeaways
Large language models and tokenization form the foundation of modern conversational AI:
- Large Language Models (LLMs) are neural networks with billions of parameters trained on massive text corpora to predict next tokens, exhibiting emergent capabilities like reasoning and instruction-following at scale
- Tokens are the fundamental units of text that LLMs process, typically subwords that balance vocabulary size and sequence length
- Tokenization converts raw text into token sequences through normalization, segmentation, and mapping to vocabulary IDs—a process that's model-specific and affects costs and context limits
- Subword tokenization splits text into frequently-occurring chunks (whole words for common terms, pieces for rare terms), handling arbitrary text while maintaining reasonable vocabulary size
- Byte Pair Encoding (BPE) is the dominant subword tokenization algorithm, iteratively merging frequent character pairs to learn useful subword units from training data
- Transformer architecture processes all tokens in parallel using self-attention and feed-forward layers stacked in many layers (often 50-100+), enabling powerful context understanding
- Attention mechanism allows each token to dynamically focus on relevant context from other tokens by computing query-key-value interactions and softmax-weighted combinations
- Multi-head attention runs multiple attention operations in parallel, each learning different types of relationships (syntactic, semantic, positional)
- Causal masking ensures autoregressive generation by preventing tokens from attending to future positions
- Modern chatbots use decoder-only transformers that generate one token at a time, with each token attending to all previous context
Understanding these concepts enables you to effectively use LLM APIs, optimize prompts and costs, debug unexpected behavior, and architect systems that combine LLMs with retrieval and other components covered in upcoming chapters.