Attention Mechanism Step-by-Step
Run the Attention Mechanism Visualizer Fullscreen
Edit the MicroSim with the p5.js editor
About This MicroSim
This step-by-step visualization demonstrates how the attention mechanism works in transformers. Walk through each stage of the computation:
- Input: Token embeddings as vectors
- Project Q,K,V: Linear projections create Query, Key, Value matrices
- Compute Scores: Query-Key dot products measure compatibility
- Softmax: Normalize scores to attention weights (probabilities)
- Weighted Sum: Combine Value vectors using attention weights
How to Use
- Step Slider: Move through the 5 stages of attention computation
- Query Position: Select which token's attention to visualize
- Observe: Watch how attention weights determine which positions to focus on
Key Concepts
The Attention Formula
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]
Query-Key-Value Intuition
- Query: "What am I looking for?"
- Key: "What do I contain?"
- Value: "What's my actual content?"
High query-key compatibility means that value contributes more to the output.
Softmax Normalization
Attention weights in each row sum to 1, creating a probability distribution over positions:
\[A_{ij} = \frac{\exp(S_{ij})}{\sum_k \exp(S_{ik})}\]
Lesson Plan
Learning Objectives:
- Understand the role of Query, Key, and Value matrices
- Trace the flow of information through attention computation
- Interpret attention weights as a soft addressing mechanism
Activities:
- Step through all 5 stages and describe what happens at each
- Change the query position and observe how attention patterns change
- Identify which tokens attend most strongly to each other
Assessment:
- Why do we scale by √d_k in the score computation?
- What does a uniform attention distribution (all weights equal) mean?
- How would masking affect the attention computation?
References
- Attention Is All You Need - Original transformer paper
- Chapter 11: Generative AI and LLMs
- The Illustrated Transformer