Multi-Head Attention Visualizer

Run the Multi-Head Attention Visualizer Fullscreen

About This MicroSim

This visualization demonstrates how multi-head attention captures diverse relationship patterns by running multiple attention operations in parallel.

Each attention head can learn to focus on different types of relationships:

Position proximity: Nearby tokens attend to each other
Semantic similarity: Words with similar meanings connect
Syntactic structure: Subject-verb, modifier-noun relationships
Long-range dependencies: Connections across the sequence

How to Use

Number of Heads: Adjust the slider to see 1-8 attention heads
Hover: Move over any head to see what pattern type it has learned
Show Concatenation: Toggle to see how head outputs combine

Key Concepts

Why Multiple Heads?

A single attention head might focus on only one type of relationship. Multiple heads allow the model to:

Capture syntactic AND semantic relationships
Attend to both local and global context
Learn diverse, complementary patterns

The Multi-Head Formula

\[\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O\]

where each head is:

\[\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

Dimension Management

With \(d_{model} = 512\) and \(h = 8\) heads:

Each head uses \(d_k = d_v = d_{model}/h = 64\)
Total computation is similar to single full-dimension head
But captures 8× richer patterns

Lesson Plan

Learning Objectives:

Understand why multiple attention heads improve model expressiveness
Visualize how different heads learn different patterns
Trace the concatenation and projection flow

Activities:

Compare attention patterns across all 8 heads
Identify which heads capture local vs. global patterns
Explain why the output projection (W_O) is necessary

Assessment:

Why not just use one head with larger dimension?
How does head count affect model capacity vs. computation?
What would happen if all heads learned the same pattern?

References

Attention Is All You Need - Original transformer paper
Chapter 11: Generative AI and LLMs
BertViz: Attention Head Visualization