Transformer Block Visualizer
Run the Transformer Block Visualizer Fullscreen
Edit the MicroSim with the p5.js editor
About This MicroSim
This visualization shows the architecture of a transformer neural network, the foundation of modern large language models like GPT and BERT.
Each transformer block contains:
- Layer Normalization - Stabilizes training
- Multi-Head Attention - Models relationships between positions
- Feed Forward Network - Applies position-wise transformations
- Residual Connections - Enable gradient flow through deep networks
How to Use
- Number of Blocks: Adjust to see 1-6 stacked transformer blocks
- Show Dimensions: Toggle to see tensor shapes at each stage
- Highlight Residuals: Toggle to emphasize residual connection paths
Key Concepts
Residual Connections
Residual connections add the input directly to the output:
\[\text{output} = x + \text{SubLayer}(x)\]
Benefits: - Enable gradient flow through very deep networks - Allow layers to learn "refinements" rather than complete transformations - Stabilize training
Layer Normalization
Normalizes across features (not batch):
\[\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma} + \beta\]
Transformer Block Structure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | |
Lesson Plan
Learning Objectives:
- Understand the structure of a transformer block
- Explain the role of residual connections
- Trace data flow through stacked transformer layers
Activities:
- Count total operations per block and multiply by block count
- Explain why residual connections help with deep networks
- Compare pre-norm vs post-norm architectures (this shows pre-norm)
Assessment:
- What would happen without residual connections in a 12-layer model?
- Why use layer norm instead of batch norm in transformers?
- How do dimensions change (or not change) through a transformer block?
References
- Attention Is All You Need - Original transformer paper
- Chapter 11: Generative AI and LLMs
- The Annotated Transformer