Normalization Comparison
Run the Normalization Comparison Fullscreen
Edit the MicroSim with the p5.js editor
About This MicroSim
This visualization clarifies the difference between batch normalization and layer normalization by showing exactly which dimensions of a tensor each technique normalizes over.
Batch Normalization
Normalizes across the batch dimension for each channel:
\(\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\)
where \(\mu_B\) and \(\sigma_B^2\) are computed over all samples in the batch, but separately for each channel.
Layer Normalization
Normalizes across the feature dimension for each sample:
\(\hat{x} = \frac{x - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}}\)
where \(\mu_L\) and \(\sigma_L^2\) are computed over all features in a single sample.
Key Differences
| Aspect | Batch Norm | Layer Norm |
|---|---|---|
| Normalizes | Per channel | Per sample |
| Batch dependency | Yes | No |
| Train/Test | Different | Same |
| Best for | CNNs, large batches | Transformers, RNNs |
| Batch size 1 | Problems | Works fine |
Interactive Features
- View Selector: Compare both techniques or focus on one
- Animated Highlight: Shows which cells are normalized together
- Properties Panel: Details about each technique
Lesson Plan
Learning Objectives
Students will be able to:
- Distinguish which tensor dimensions batch norm and layer norm operate over
- Explain why batch normalization needs different behavior during training vs inference
- Choose the appropriate normalization for different architectures
- Describe the role of learnable parameters γ and β
Suggested Activities
- Dimension Tracing: For a tensor of shape (B, C, H, W), identify which values batch norm averages together
- Architecture Matching: Why do transformers prefer layer norm while CNNs use batch norm?
- Small Batch Problem: What happens to batch norm statistics with batch size 1?
- Running Statistics: Why does batch norm track running mean/variance during training?
Discussion Questions
- Why can't batch normalization be used effectively with batch size 1?
- How do the learnable parameters γ and β help after normalization?
- Why is layer normalization independent of batch size?
- When might you use instance normalization or group normalization?
References
- Ioffe & Szegedy (2015). Batch Normalization: Accelerating Deep Network Training
- Ba, Kiros & Hinton (2016). Layer Normalization
- Wu & He (2018). Group Normalization