Backpropagation

About This MicroSim

This visualization demonstrates how gradients propagate backward through a neural network using the chain rule. Understanding backpropagation is essential for grasping how neural networks learn.

The Backpropagation Algorithm

Starting from the output layer and working backward:

Output Error: \(\delta^{[L]} = (\hat{y} - y) \cdot \sigma'(z^{[L]})\)
Hidden Layer Error: \(\delta^{[l]} = (W^{[l+1]})^T \delta^{[l+1]} \odot \sigma'(z^{[l]})\)
Weight Gradients: \(\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \delta^{[l]} (\mathbf{a}^{[l-1]})^T\)

Key Insight: The Transpose

Notice how \(W^{[l+1]}\) appears transposed when propagating gradients backward. This "distributes" each output error back to the neurons that contributed to it.

Interactive Features

Forward Pass: First compute all activations (required before backprop)
Backward Step: Step through gradient computation layer by layer
Target Slider: Change the target value and see how gradients change
Auto Mode: Watch the full backward pass animate

Visual Indicators

δ values: Error signals shown above each neuron
Red/Blue colors: Positive/negative gradients
Arrows: Direction of gradient flow
∂ values: Weight gradients on connections

Lesson Plan

Learning Objectives

Students will be able to:

Explain how the chain rule enables gradient computation through composed functions
Describe why the weight matrix transpose appears in backpropagation
Compute error signals (δ) at each layer
Calculate weight gradients from error signals and activations

Suggested Activities

Manual Backprop: Verify the δ and gradient values by hand calculation
Target Exploration: Change the target from 0 to 1 and observe how gradients flip sign
Trace the Chain: For one weight, write out the full chain rule expression
Dimension Verification: Confirm that \(\delta^{[l]} (a^{[l-1]})^T\) has the same shape as \(W^{[l]}\)

Discussion Questions

Why does the transpose of \(W^{[l+1]}\) appear when computing \(\delta^{[l]}\)?
What happens to gradients when neurons have zero activation (dead ReLU)?
How does the magnitude of the output error affect all gradients in the network?
Why is it important that gradient dimensions match weight dimensions?

Mathematical Details

For MSE loss: \(\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2\)

The derivative: \(\frac{\partial \mathcal{L}}{\partial \hat{y}} = \hat{y} - y\)

References

Rumelhart, Hinton & Williams (1986). Learning Representations by Back-propagating Errors
Goodfellow et al. (2016). Deep Learning, Chapter 6.5