Transformer Block Visualizer

Run the Transformer Block Visualizer Fullscreen

About This MicroSim

This visualization shows the architecture of a transformer neural network, the foundation of modern large language models like GPT and BERT.

Each transformer block contains:

Layer Normalization - Stabilizes training
Multi-Head Attention - Models relationships between positions
Feed Forward Network - Applies position-wise transformations
Residual Connections - Enable gradient flow through deep networks

How to Use

Number of Blocks: Adjust to see 1-6 stacked transformer blocks
Show Dimensions: Toggle to see tensor shapes at each stage
Highlight Residuals: Toggle to emphasize residual connection paths

Key Concepts

Residual Connections

Residual connections add the input directly to the output:

\[\text{output} = x + \text{SubLayer}(x)\]

Benefits: - Enable gradient flow through very deep networks - Allow layers to learn "refinements" rather than complete transformations - Stabilize training

Layer Normalization

Normalizes across features (not batch):

\[\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma} + \beta\]

Transformer Block Structure

Input x
    │
    ├──────────────────┐
    ▼                  │
LayerNorm              │
    │                  │
MultiHeadAttn          │
    │                  │
    ▼                  │
    + ◄────────────────┘  (residual)
    │
    ├──────────────────┐
    ▼                  │
LayerNorm              │
    │                  │
FeedForward            │
    │                  │
    ▼                  │
    + ◄────────────────┘  (residual)
    │
Output

Lesson Plan

Learning Objectives:

Understand the structure of a transformer block
Explain the role of residual connections
Trace data flow through stacked transformer layers

Activities:

Count total operations per block and multiply by block count
Explain why residual connections help with deep networks
Compare pre-norm vs post-norm architectures (this shows pre-norm)

Assessment:

What would happen without residual connections in a 12-layer model?
Why use layer norm instead of batch norm in transformers?
How do dimensions change (or not change) through a transformer block?

References

Attention Is All You Need - Original transformer paper
Chapter 11: Generative AI and LLMs
The Annotated Transformer