Optimizer Selection Guide
Definition
Optimizer selection is the strategic process of choosing the most appropriate optimization algorithm for a given machine learning problem based on dataset characteristics, model architecture, computational constraints, and desired convergence properties. With dozens of optimizers available—from simple SGD to sophisticated adaptive methods like Adam and second-order techniques—understanding their trade-offs is crucial for efficient training. The right optimizer can mean the difference between convergence in minutes versus hours, reaching 95% versus 99% accuracy, or training successfully versus divergence. This guide provides a systematic framework for selecting optimizers based on problem type, scale, convexity, and practical constraints. Selection criteria include convergence speed, memory requirements, sensitivity to hyperparameters, robustness to gradient noise, and suitability for specific architectures.
Intuition
Think of choosing an optimizer like selecting a vehicle for a journey. SGD is like walking—simple, works anywhere, but slow. SGD with momentum is like a bicycle—adds speed with minimal complexity. Adam is like a modern electric car—adaptive, easy to drive, works well in most conditions without much tuning. L-BFGS is like a sports car—fast and efficient on highways (convex problems) but terrible off-road (stochastic). Newton's method is like a rocket—incredibly fast but requires perfect conditions and is prohibitively expensive for most trips. The terrain matters: smooth roads (convex problems) favor different vehicles than rocky trails (non-convex deep learning). Distance matters: short trips can use any method, but cross-country journeys need efficiency. Cargo matters: carrying heavy loads (large models) limits your options. The best optimizer isn't universally 'best'—it's the one that matches your specific journey's constraints.
Mathematical Formula
Step-by-Step Explanation:
- Convergence rate indicates how quickly error decreases: quadratic is fastest, linear is slowest
- Memory complexity shows per-iteration storage requirements for optimizer state
- Per-step cost indicates computational expense of each parameter update
- Trade-off: faster convergence often requires more memory and computation per step
- No free lunch: best optimizer depends on problem characteristics
- Practical selection balances theoretical guarantees with empirical performance
Real-World Use Cases
Training ResNet/EfficientNet on ImageNet: SGD+momentum with cosine annealing is standard; Adam sometimes used but may generalize worse. Momentum coefficient 0.9, initial LR 0.1-0.4.
Training BERT/GPT: Adam with β=(0.9, 0.999), ε=1e-8, warmup + linear decay. AdamW preferred for decoupled weight decay. Peak LR 1e-4 to 5e-4.
Matrix factorization for collaborative filtering: L-BFGS for small-to-medium scale; SGD with adaptive LR for large scale. Handles sparse gradients efficiently.
Training gradient boosting or shallow networks: Adam works well out-of-box; SGD+momentum with careful tuning can match.
Training PPO/SAC policies: Adam standard due to noisy gradients; smaller ε (1e-5) often helps. LR typically 3e-4.
Fine-tuning pretrained models: smaller LR (1e-5 to 1e-3), often SGD or Adam with cosine decay. Layer-wise LR decay common.
Implementation
Manual Implementation (No Libraries)
import numpy as np
import time
def compare_optimizers(X, y, X_val, y_val, optimizers, n_epochs=100):
"""
Compare multiple optimizers on the same problem.
Args:
X, y: Training data
X_val, y_val: Validation data
optimizers: Dict of {name: optimizer_fn}
n_epochs: Number of epochs to train
"""
results = {}
for name, opt_fn in optimizers.items():
print(f'
=== Training with {name} ===')
# Initialize model
n_features = X.shape[1]
theta = np.random.randn(n_features) * 0.01
# Optimizer state
state = opt_fn['init'](theta)
train_losses = []
val_losses = []
start_time = time.time()
for epoch in range(n_epochs):
# Compute gradient
y_pred = X @ theta
grad = X.T @ (y_pred - y) / len(X)
# Update using optimizer
theta, state = opt_fn['update'](theta, grad, state)
# Track metrics
train_loss = np.mean((X @ theta - y) ** 2)
val_loss = np.mean((X_val @ theta - y_val) ** 2)
train_losses.append(train_loss)
val_losses.append(val_loss)
elapsed = time.time() - start_time
results[name] = {
'final_train_loss': train_losses[-1],
'final_val_loss': val_losses[-1],
'train_losses': train_losses,
'val_losses': val_losses,
'time': elapsed,
'convergence_epoch': next((i for i, l in enumerate(val_losses)
if l < 1.1 * min(val_losses)), n_epochs)
}
print(f'Final train loss: {train_losses[-1]:.6f}')
print(f'Final val loss: {val_losses[-1]:.6f}')
print(f'Time: {elapsed:.3f}s')
return results
# Define optimizers
optimizers = {
'SGD': {
'init': lambda theta: {'lr': 0.01},
'update': lambda t, g, s: (t - s['lr'] * g, s)
},
'SGD+Momentum': {
'init': lambda theta: {'lr': 0.01, 'beta': 0.9, 'v': np.zeros_like(theta)},
'update': lambda t, g, s: (
(t - s['lr'] * (s['v'] * s['beta'] + g),
{**s, 'v': s['beta'] * s['v'] + g})
)
},
'RMSprop': {
'init': lambda theta: {'lr': 0.01, 'beta': 0.9, 'eps': 1e-8,
'v': np.zeros_like(theta)},
'update': lambda t, g, s: (
(t - s['lr'] * g / (np.sqrt(s['v'] * s['beta'] + (1-s['beta'])*g**2) + s['eps']),
{**s, 'v': s['beta'] * s['v'] + (1-s['beta'])*g**2})
)
},
'Adam': {
'init': lambda theta: {'lr': 0.01, 'beta1': 0.9, 'beta2': 0.999,
'eps': 1e-8, 'm': np.zeros_like(theta),
'v': np.zeros_like(theta), 't': 0},
'update': lambda t, g, s: (
lambda m, v, step: (
t - s['lr'] * m / (np.sqrt(v) + s['eps']),
{**s, 'm': m, 'v': v, 't': step}
)
)(s['beta1']*s['m']+(1-s['beta1'])*g,
s['beta2']*s['v']+(1-s['beta2'])*g**2,
s['t']+1)
}
}
# Generate data
np.random.seed(42)
n_samples, n_features = 1000, 10
X = np.random.randn(n_samples, n_features)
true_theta = np.random.randn(n_features)
y = X @ true_theta + np.random.randn(n_samples) * 0.1
# Split train/val
split = int(0.8 * n_samples)
X_train, X_val = X[:split], X[split:]
y_train, y_val = y[:split], y[split:]
# Compare
results = compare_optimizers(X_train, y_train, X_val, y_val, optimizers, n_epochs=100)
# Summary
print('
=== Optimizer Comparison Summary ===')
for name, res in results.items():
print(f'{name:15s}: Val Loss={res["final_val_loss"]:.6f}, ' +
f'Time={res["time"]:.3f}s, Converged@E{res["convergence_epoch"]}')
Using Libraries (torch.optim.*, tensorflow.keras.optimizers.*)
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
def benchmark_optimizer(model_class, dataset, optimizer_class,
optimizer_kwargs, epochs=10):
"""Benchmark a specific optimizer configuration."""
model = model_class()
optimizer = optimizer_class(model.parameters(), **optimizer_kwargs)
criterion = nn.MSELoss()
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
losses = []
import time
start = time.time()
for epoch in range(epochs):
epoch_loss = 0
for X, y in dataloader:
optimizer.zero_grad()
pred = model(X)
loss = criterion(pred, y)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
losses.append(epoch_loss / len(dataloader))
elapsed = time.time() - start
return losses, elapsed
# Common optimizer configurations
configs = {
'SGD': (torch.optim.SGD, {'lr': 0.01}),
'SGD+Momentum': (torch.optim.SGD, {'lr': 0.01, 'momentum': 0.9}),
'RMSprop': (torch.optim.RMSprop, {'lr': 0.001}),
'Adam': (torch.optim.Adam, {'lr': 0.001}),
'AdamW': (torch.optim.AdamW, {'lr': 0.001, 'weight_decay': 0.01}),
'Adagrad': (torch.optim.Adagrad, {'lr': 0.01})
}
# TensorFlow
import tensorflow as tf
def get_tf_optimizer(name):
optimizers = {
'SGD': tf.keras.optimizers.SGD(learning_rate=0.01),
'SGD+Momentum': tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
'RMSprop': tf.keras.optimizers.RMSprop(learning_rate=0.001),
'Adam': tf.keras.optimizers.Adam(learning_rate=0.001),
'AdamW': tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.01)
}
return optimizers.get(name)
When to Use
✅ Appropriate Use Cases:
- SGD
- SGD+Momentum
- Adam
- AdamW
- RMSprop
- L-BFGS
- Adagrad
Common Pitfalls
- {'pitfall': 'Always using Adam without trying SGD', 'description': 'Adam converges faster initially but may reach worse final minima than well-tuned SGD+momentum.', 'solution': 'For final production models, compare Adam vs SGD+momentum. Try Adam for fast iteration, SGD for final training.'}
- {'pitfall': 'Wrong learning rate for batch size', 'description': 'Increasing batch size without scaling LR causes slower convergence; linear scaling rule applies.', 'solution': 'When batch size increases by k, multiply LR by k (with warmup). Use LR finder to find good initial value.'}
- {'pitfall': 'Using Adam with L2 regularization', 'description': "Standard weight decay (L2) doesn't work correctly with Adam's adaptive learning rates.", 'solution': 'Use AdamW which properly decouples weight decay from gradient updates.'}
- {'pitfall': 'Not using learning rate scheduling', 'description': 'Fixed LR throughout training is almost always suboptimal.', 'solution': 'Always use some form of decay: cosine annealing for vision, linear for NLP, step decay as simple alternative.'}
- {'pitfall': 'Inappropriate optimizer for problem type', 'description': 'Using L-BFGS with mini-batches or Adam for small convex problems wastes resources.', 'solution': 'Match optimizer to problem: L-BFGS for convex full-batch, SGD/Adam for stochastic deep learning.'}