Learning Rate Effect on Convergence
Run the Learning Rate Effect MicroSim Fullscreen
Edit the MicroSim with the p5.js editor
About This MicroSim
This visualization demonstrates how the choice of learning rate dramatically affects gradient descent optimization. By running three optimizations simultaneously with different learning rates, students can directly observe and compare convergence behaviors.
Learning Objective: Understand how learning rate choice affects optimization behavior through side-by-side comparison.
How to Use
- Click "Run All" to start the optimization on all three panels simultaneously
- Adjust individual learning rates using the sliders below each panel
- Use preset buttons to quickly set typical scenarios:
- Too Small: Very slow convergence
- Just Right: Efficient convergence
- Too Large: Oscillation or divergence
- Adjust animation speed to slow down or speed up the visualization
- Click "Reset All" to restart from the initial position
Key Concepts Demonstrated
The Learning Rate Tradeoff
The learning rate \(\eta\) (eta) controls the step size in gradient descent:
- Too small (\(\eta < 0.01\)): Safe but slow, may never reach optimum in reasonable time
- Just right (\(\eta \approx 0.1\)): Fast and stable convergence
- Too large (\(\eta > 0.3\)): May overshoot, oscillate, or diverge entirely
Visual Indicators
Each panel shows:
- Contour plot: Elliptical contours of the loss function
- Path trace: The trajectory taken by gradient descent
- Status indicator:
- Converging (blue): Making steady progress
- Converged (green): Reached the minimum
- Oscillating (yellow): Bouncing around the minimum
- Diverging (red): Moving away from the minimum
- Loss curve: Real-time plot of loss value over iterations
- Step count: Number of iterations taken
Loss Function
The visualization uses a quadratic loss function:
This creates elliptical contours where:
- The minimum is at the origin \((0, 0)\)
- The y-direction has a steeper gradient (factor of 3)
- Different eigenvalues create the classic "elongated bowl" optimization challenge
Why Large Learning Rates Cause Problems
For quadratic functions, stability requires:
where \(\lambda_{max}\) is the largest eigenvalue of the Hessian. In our case:
- Hessian eigenvalues: 2 and 6
- Maximum stable learning rate: \(\eta < 2/6 \approx 0.33\)
Beyond this threshold, the optimizer overshoots and may diverge.
Mathematical Details
Gradient Descent Update
At each step, the algorithm computes:
And updates:
Convergence Rate
For quadratic functions, the convergence rate is:
The optimal learning rate that minimizes this bound is:
Embedding This MicroSim
1 2 3 4 5 | |
Lesson Plan
Grade Level
Undergraduate machine learning or optimization course
Duration
20-25 minutes
Prerequisites
- Understanding of gradient descent
- Basic calculus (partial derivatives)
- Familiarity with loss functions
Learning Activities
- Initial Exploration (5 min):
- Run with default settings
- Observe the three different behaviors
-
Note which optimizer reaches the minimum first
-
Learning Rate Sensitivity (5 min):
- Use "Too Large" preset
- Watch oscillation and divergence
-
Identify the threshold where behavior changes
-
Finding Optimal Rate (5 min):
- Manually adjust sliders to find the fastest convergence
- Compare step counts to convergence
-
Discuss the tradeoff between speed and stability
-
Analysis Discussion (5 min):
- Why does the y-direction cause more oscillation?
- How do eigenvalues relate to the optimal learning rate?
-
What happens at the stability boundary?
-
Real-World Connection (5 min):
- How do modern optimizers (Adam, RMSprop) handle this?
- Why is learning rate scheduling important?
- Connection to neural network training
Discussion Questions
- Why does the optimizer oscillate more in the y-direction with large learning rates?
- What is the relationship between the loss function's curvature and the optimal learning rate?
- How would you design an adaptive learning rate algorithm based on these observations?
- Why might a learning rate that's "just right" for one problem be wrong for another?
- How do momentum-based optimizers help with oscillation?
Assessment Ideas
- Predict the behavior given a specific learning rate
- Calculate the maximum stable learning rate for a given Hessian
- Design an experiment to find the optimal learning rate empirically
- Explain the convergence/divergence criteria mathematically
Connections to Machine Learning
Neural Network Training
- Learning rate is one of the most important hyperparameters
- Too small: training takes forever, may get stuck
- Too large: loss explodes, training fails
- Common strategy: start larger, decay over time
Learning Rate Schedules
- Step decay: Reduce by factor after fixed epochs
- Exponential decay: \(\eta_t = \eta_0 e^{-kt}\)
- Cosine annealing: Smooth decrease following cosine curve
- Warmup: Start small, increase, then decay
Adaptive Methods
Modern optimizers adapt the learning rate per-parameter: - AdaGrad: Accumulates squared gradients - RMSprop: Exponential moving average of squared gradients - Adam: Combines momentum with adaptive rates
References
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8.
- Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. COMPSTAT.
- Why Learning Rate is So Important - Jeremy Jordan
- Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747.