Activation Functions

About This MicroSim

This visualization compares common neural network activation functions, showing both their shape and derivative behavior—crucial for understanding gradient flow during backpropagation.

Activation Functions Included

Function	Formula	Range	Key Property
ReLU	max(0, x)	[0, ∞)	Efficient, sparse
Sigmoid	1/(1+e⁻ˣ)	(0, 1)	Probability output
Tanh	(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)	(-1, 1)	Zero-centered
Leaky ReLU	max(0.1x, x)	(-∞, ∞)	No dead neurons
Softplus	log(1+eˣ)	(0, ∞)	Smooth ReLU

Interactive Features

Function Selector: Choose which activation to examine
Show Derivative: Toggle to display f'(x) as dashed line
Compare All: Overlay all functions for comparison
Input Slider: Trace along the curve to see exact values
Info Panel: Shows f(x), f'(x), range, and gradient status

Visual Indicators

Yellow regions: Low gradient areas (|f'(x)| < 0.1) where vanishing gradients occur
Solid line: The activation function f(x)
Dashed line: The derivative f'(x)

Lesson Plan

Learning Objectives

Students will be able to:

Describe the shape and range of common activation functions
Explain why nonlinear activations are necessary
Identify regions where gradients vanish
Choose appropriate activations for different use cases

Suggested Activities

Gradient Exploration: Move the slider to x = -3 for sigmoid. What happens to f'(x)?
Compare ReLU Family: Look at ReLU, Leaky ReLU, and Softplus side by side
Saturation Investigation: Find where sigmoid and tanh have near-zero gradients
Zero-Centered Discussion: Compare sigmoid (not zero-centered) with tanh (zero-centered)

Discussion Questions

Why does ReLU dominate modern deep learning despite having a discontinuous derivative?
What does "vanishing gradient" mean and why is it a problem?
When would you choose sigmoid over tanh for an output layer?
Why might Leaky ReLU be preferred over standard ReLU?

References

Nair & Hinton (2010). Rectified Linear Units Improve Restricted Boltzmann Machines
Glorot et al. (2011). Deep Sparse Rectifier Neural Networks