Normalization Comparison

Run the Normalization Comparison Fullscreen

About This MicroSim

This visualization clarifies the difference between batch normalization and layer normalization by showing exactly which dimensions of a tensor each technique normalizes over.

Batch Normalization

Normalizes across the batch dimension for each channel:

\(\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\)

where \(\mu_B\) and \(\sigma_B^2\) are computed over all samples in the batch, but separately for each channel.

Layer Normalization

Normalizes across the feature dimension for each sample:

\(\hat{x} = \frac{x - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}}\)

where \(\mu_L\) and \(\sigma_L^2\) are computed over all features in a single sample.

Key Differences

Aspect	Batch Norm	Layer Norm
Normalizes	Per channel	Per sample
Batch dependency	Yes	No
Train/Test	Different	Same
Best for	CNNs, large batches	Transformers, RNNs
Batch size 1	Problems	Works fine

Interactive Features

View Selector: Compare both techniques or focus on one
Animated Highlight: Shows which cells are normalized together
Properties Panel: Details about each technique

Lesson Plan

Learning Objectives

Students will be able to:

Distinguish which tensor dimensions batch norm and layer norm operate over
Explain why batch normalization needs different behavior during training vs inference
Choose the appropriate normalization for different architectures
Describe the role of learnable parameters γ and β

Suggested Activities

Dimension Tracing: For a tensor of shape (B, C, H, W), identify which values batch norm averages together
Architecture Matching: Why do transformers prefer layer norm while CNNs use batch norm?
Small Batch Problem: What happens to batch norm statistics with batch size 1?
Running Statistics: Why does batch norm track running mean/variance during training?

Discussion Questions

Why can't batch normalization be used effectively with batch size 1?
How do the learnable parameters γ and β help after normalization?
Why is layer normalization independent of batch size?
When might you use instance normalization or group normalization?

References

Ioffe & Szegedy (2015). Batch Normalization: Accelerating Deep Network Training
Ba, Kiros & Hinton (2016). Layer Normalization
Wu & He (2018). Group Normalization