Quiz: Optimization and Learning Algorithms

Test your understanding of optimization methods for machine learning.

1. The Hessian matrix contains:

First-order partial derivatives
Second-order partial derivatives
The gradient vector
The loss function values

Show Answer

The correct answer is B. The Hessian matrix contains all second-order partial derivatives: \(H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}\). It captures the curvature of the function.

Concept Tested: Hessian Matrix

2. A function is convex if:

It has multiple local minima
Any chord lies above or on the function graph
Its gradient is always zero
It is defined only for positive inputs

Show Answer

The correct answer is B. A convex function lies below any chord connecting two points on its graph: \(f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda)f(y)\). Convex functions have a single global minimum.

Concept Tested: Convex Function

3. Newton's method uses:

Only first-order gradient information
Second-order Hessian information for faster convergence
Random search
No derivatives at all

Show Answer

The correct answer is B. Newton's method uses both gradient and Hessian: \(\mathbf{x}_{k+1} = \mathbf{x}_k - H^{-1}\nabla f\). The Hessian provides curvature information enabling quadratic convergence near the solution.

Concept Tested: Newton's Method

4. Stochastic Gradient Descent (SGD) differs from batch gradient descent by:

Never converging
Using gradients from random subsets of data
Requiring second-order derivatives
Only working on convex functions

Show Answer

The correct answer is B. SGD computes gradients using random mini-batches rather than the full dataset. This introduces noise but enables much faster iteration, especially with large datasets.

Concept Tested: SGD

5. Momentum in optimization:

Slows down convergence
Accumulates gradient information to accelerate and dampen oscillations
Eliminates the learning rate
Only works for linear functions

Show Answer

The correct answer is B. Momentum maintains a velocity vector that accumulates past gradients, helping to accelerate convergence in consistent directions while dampening oscillations in others.

Concept Tested: Momentum

6. The Adam optimizer combines:

Newton's method and SGD
Momentum and adaptive learning rates (RMSprop-like)
L1 and L2 regularization
Batch and online learning

Show Answer

The correct answer is B. Adam combines momentum (first moment) with RMSprop-style adaptive learning rates (second moment), plus bias correction. This makes it robust across many problem types.

Concept Tested: Adam Optimizer

7. RMSprop adapts learning rates by:

Keeping them constant
Dividing by the running average of squared gradients
Multiplying by the gradient magnitude
Using second-order derivatives

Show Answer

The correct answer is B. RMSprop divides the learning rate by the root mean square of recent gradients, reducing step size for parameters with large gradients and increasing it for those with small gradients.

Concept Tested: RMSprop

8. Lagrange multipliers are used to:

Increase the number of variables
Convert constrained optimization to unconstrained
Compute the Hessian
Initialize weights randomly

Show Answer

The correct answer is B. Lagrange multipliers transform constrained optimization problems into unconstrained ones by incorporating constraints into the objective function through the Lagrangian.

Concept Tested: Lagrange Multiplier

9. The KKT conditions generalize Lagrange multipliers to handle:

Only equality constraints
Inequality constraints
Unconstrained problems
Non-differentiable functions

Show Answer

The correct answer is B. The Karush-Kuhn-Tucker (KKT) conditions extend Lagrange multipliers to problems with inequality constraints, introducing complementary slackness conditions.

Concept Tested: KKT Conditions

10. Mini-batch SGD is preferred over pure SGD (batch size 1) because:

It never converges
It reduces gradient variance while maintaining computational efficiency
It requires more memory than full-batch
It eliminates the need for a learning rate

Show Answer

The correct answer is B. Mini-batches reduce gradient variance (averaging over multiple samples) while still being much faster than full-batch. They also enable efficient GPU parallelization.

Concept Tested: Mini-Batch SGD