SGD Trajectory Visualizer

About This MicroSim

This visualization demonstrates how batch size affects the behavior of Stochastic Gradient Descent (SGD). Small batch sizes lead to noisy, erratic optimization paths, while larger batch sizes produce smoother convergence.

How to Use

Adjust Batch Size: Slide from 1 (pure SGD) to 128 (full batch)
Adjust Learning Rate: Control step size
Step/Run: Execute optimization steps manually or automatically
Show Noise: Toggle visualization of gradient variance
Click: Click anywhere on the plot to set a new starting point

Key Observations

Batch Size	Behavior
1 (Pure SGD)	Very noisy, erratic path
8-32	Moderate noise, faster per-step
128 (Full Batch)	Smooth path, slower per-step

Understanding the Visualization

Blue Path: The optimization trajectory
Orange Cloud: Represents gradient variance (uncertainty)
Gray Arrow: True gradient direction
Orange Arrows: Sample stochastic gradients

Why Batch Size Matters

The variance of the stochastic gradient estimate decreases with batch size:

\[\text{Var}[\nabla f_B] = \frac{\text{Var}[\nabla f_i]}{|B|}\]

Larger batches give more accurate gradient estimates but require more computation per step.

Lesson Plan

Learning Objectives

Understand the trade-off between gradient accuracy and computation
Visualize how noise affects optimization paths
Compare pure SGD, mini-batch, and full-batch gradient descent

Suggested Activities

Pure SGD: Set batch size to 1 and observe the noisy trajectory
Mini-Batch: Try batch sizes of 8, 16, 32 and compare smoothness
Full Batch: Set to 128 and see the deterministic path
Learning Rate Interaction: High learning rate + small batch = unstable

References

Goodfellow et al., Deep Learning, Chapter 8
Wikipedia: Stochastic Gradient Descent