Backpropagation
Run the Backpropagation Fullscreen
Edit the MicroSim with the p5.js editor
About This MicroSim
This visualization demonstrates how gradients propagate backward through a neural network using the chain rule. Understanding backpropagation is essential for grasping how neural networks learn.
The Backpropagation Algorithm
Starting from the output layer and working backward:
- Output Error: \(\delta^{[L]} = (\hat{y} - y) \cdot \sigma'(z^{[L]})\)
- Hidden Layer Error: \(\delta^{[l]} = (W^{[l+1]})^T \delta^{[l+1]} \odot \sigma'(z^{[l]})\)
- Weight Gradients: \(\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \delta^{[l]} (\mathbf{a}^{[l-1]})^T\)
Key Insight: The Transpose
Notice how \(W^{[l+1]}\) appears transposed when propagating gradients backward. This "distributes" each output error back to the neurons that contributed to it.
Interactive Features
- Forward Pass: First compute all activations (required before backprop)
- Backward Step: Step through gradient computation layer by layer
- Target Slider: Change the target value and see how gradients change
- Auto Mode: Watch the full backward pass animate
Visual Indicators
- δ values: Error signals shown above each neuron
- Red/Blue colors: Positive/negative gradients
- Arrows: Direction of gradient flow
- ∂ values: Weight gradients on connections
Lesson Plan
Learning Objectives
Students will be able to:
- Explain how the chain rule enables gradient computation through composed functions
- Describe why the weight matrix transpose appears in backpropagation
- Compute error signals (δ) at each layer
- Calculate weight gradients from error signals and activations
Suggested Activities
- Manual Backprop: Verify the δ and gradient values by hand calculation
- Target Exploration: Change the target from 0 to 1 and observe how gradients flip sign
- Trace the Chain: For one weight, write out the full chain rule expression
- Dimension Verification: Confirm that \(\delta^{[l]} (a^{[l-1]})^T\) has the same shape as \(W^{[l]}\)
Discussion Questions
- Why does the transpose of \(W^{[l+1]}\) appear when computing \(\delta^{[l]}\)?
- What happens to gradients when neurons have zero activation (dead ReLU)?
- How does the magnitude of the output error affect all gradients in the network?
- Why is it important that gradient dimensions match weight dimensions?
Mathematical Details
For MSE loss: \(\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2\)
The derivative: \(\frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y\)
References
- Rumelhart, Hinton & Williams (1986). Learning Representations by Back-propagating Errors
- Goodfellow et al. (2016). Deep Learning, Chapter 6.5