Optimizer Selection Guide

Intermediate Optimization
~9 min read Optimization

Definition

Optimizer selection is the strategic process of choosing the most appropriate optimization algorithm for a given machine learning problem based on dataset characteristics, model architecture, computational constraints, and desired convergence properties. With dozens of optimizers available—from simple SGD to sophisticated adaptive methods like Adam and second-order techniques—understanding their trade-offs is crucial for efficient training. The right optimizer can mean the difference between convergence in minutes versus hours, reaching 95% versus 99% accuracy, or training successfully versus divergence. This guide provides a systematic framework for selecting optimizers based on problem type, scale, convexity, and practical constraints. Selection criteria include convergence speed, memory requirements, sensitivity to hyperparameters, robustness to gradient noise, and suitability for specific architectures.

Intuition

💡

Think of choosing an optimizer like selecting a vehicle for a journey. SGD is like walking—simple, works anywhere, but slow. SGD with momentum is like a bicycle—adds speed with minimal complexity. Adam is like a modern electric car—adaptive, easy to drive, works well in most conditions without much tuning. L-BFGS is like a sports car—fast and efficient on highways (convex problems) but terrible off-road (stochastic). Newton's method is like a rocket—incredibly fast but requires perfect conditions and is prohibitively expensive for most trips. The terrain matters: smooth roads (convex problems) favor different vehicles than rocky trails (non-convex deep learning). Distance matters: short trips can use any method, but cross-country journeys need efficiency. Cargo matters: carrying heavy loads (large models) limits your options. The best optimizer isn't universally 'best'—it's the one that matches your specific journey's constraints.

Mathematical Formula

\[ \text{Selection Criteria:} \quad \text{Convergence: } O(1/t) \text{ (GD)}, O(1/\sqrt{t}) \text{ (SGD)}, O(\rho^{2^t}) \text{ (Newton)} \quad \text{Memory: } O(d) \text{ (1st order)}, O(d) \text{ (momentum)}, O(2d) \text{ (Adam)}, O(nd) \text{ (L-BFGS)}, O(d^2) \text{ (Newton)} \quad \text{Per-step cost: } O(d) \text{ (GD/SGD)}, O(d) \text{ (Adam)}, O(nd) \text{ (L-BFGS)}, O(d^2) \text{ (Newton)} \text{where } d = \text{parameters}, n = \text{samples}, t = \text{iterations} \]

Step-by-Step Explanation:

  1. Convergence rate indicates how quickly error decreases: quadratic is fastest, linear is slowest
  2. Memory complexity shows per-iteration storage requirements for optimizer state
  3. Per-step cost indicates computational expense of each parameter update
  4. Trade-off: faster convergence often requires more memory and computation per step
  5. No free lunch: best optimizer depends on problem characteristics
  6. Practical selection balances theoretical guarantees with empirical performance

Real-World Use Cases

Computer Vision

Training ResNet/EfficientNet on ImageNet: SGD+momentum with cosine annealing is standard; Adam sometimes used but may generalize worse. Momentum coefficient 0.9, initial LR 0.1-0.4.

NLP Pretraining

Training BERT/GPT: Adam with β=(0.9, 0.999), ε=1e-8, warmup + linear decay. AdamW preferred for decoupled weight decay. Peak LR 1e-4 to 5e-4.

Recommendation Systems

Matrix factorization for collaborative filtering: L-BFGS for small-to-medium scale; SGD with adaptive LR for large scale. Handles sparse gradients efficiently.

Tabular Data

Training gradient boosting or shallow networks: Adam works well out-of-box; SGD+momentum with careful tuning can match.

Reinforcement Learning

Training PPO/SAC policies: Adam standard due to noisy gradients; smaller ε (1e-5) often helps. LR typically 3e-4.

Transfer Learning

Fine-tuning pretrained models: smaller LR (1e-5 to 1e-3), often SGD or Adam with cosine decay. Layer-wise LR decay common.

Implementation

Manual Implementation (No Libraries)

The implementation provides a framework for comparing optimizers on the same problem. It tracks final loss, convergence speed, and wall-clock time. This empirical comparison complements theoretical analysis. Different optimizers show different convergence curves—Adam typically fastest initially, SGD+momentum often best final performance.
import numpy as np
import time

def compare_optimizers(X, y, X_val, y_val, optimizers, n_epochs=100):
    """
    Compare multiple optimizers on the same problem.
    
    Args:
        X, y: Training data
        X_val, y_val: Validation data
        optimizers: Dict of {name: optimizer_fn}
        n_epochs: Number of epochs to train
    """
    results = {}
    
    for name, opt_fn in optimizers.items():
        print(f'
=== Training with {name} ===')
        
        # Initialize model
        n_features = X.shape[1]
        theta = np.random.randn(n_features) * 0.01
        
        # Optimizer state
        state = opt_fn['init'](theta)
        
        train_losses = []
        val_losses = []
        start_time = time.time()
        
        for epoch in range(n_epochs):
            # Compute gradient
            y_pred = X @ theta
            grad = X.T @ (y_pred - y) / len(X)
            
            # Update using optimizer
            theta, state = opt_fn['update'](theta, grad, state)
            
            # Track metrics
            train_loss = np.mean((X @ theta - y) ** 2)
            val_loss = np.mean((X_val @ theta - y_val) ** 2)
            train_losses.append(train_loss)
            val_losses.append(val_loss)
        
        elapsed = time.time() - start_time
        
        results[name] = {
            'final_train_loss': train_losses[-1],
            'final_val_loss': val_losses[-1],
            'train_losses': train_losses,
            'val_losses': val_losses,
            'time': elapsed,
            'convergence_epoch': next((i for i, l in enumerate(val_losses)
                                      if l < 1.1 * min(val_losses)), n_epochs)
        }
        
        print(f'Final train loss: {train_losses[-1]:.6f}')
        print(f'Final val loss: {val_losses[-1]:.6f}')
        print(f'Time: {elapsed:.3f}s')
    
    return results

# Define optimizers
optimizers = {
    'SGD': {
        'init': lambda theta: {'lr': 0.01},
        'update': lambda t, g, s: (t - s['lr'] * g, s)
    },
    'SGD+Momentum': {
        'init': lambda theta: {'lr': 0.01, 'beta': 0.9, 'v': np.zeros_like(theta)},
        'update': lambda t, g, s: (
            (t - s['lr'] * (s['v'] * s['beta'] + g),
             {**s, 'v': s['beta'] * s['v'] + g})
        )
    },
    'RMSprop': {
        'init': lambda theta: {'lr': 0.01, 'beta': 0.9, 'eps': 1e-8,
                               'v': np.zeros_like(theta)},
        'update': lambda t, g, s: (
            (t - s['lr'] * g / (np.sqrt(s['v'] * s['beta'] + (1-s['beta'])*g**2) + s['eps']),
             {**s, 'v': s['beta'] * s['v'] + (1-s['beta'])*g**2})
        )
    },
    'Adam': {
        'init': lambda theta: {'lr': 0.01, 'beta1': 0.9, 'beta2': 0.999,
                               'eps': 1e-8, 'm': np.zeros_like(theta),
                               'v': np.zeros_like(theta), 't': 0},
        'update': lambda t, g, s: (
            lambda m, v, step: (
                t - s['lr'] * m / (np.sqrt(v) + s['eps']),
                {**s, 'm': m, 'v': v, 't': step}
            )
        )(s['beta1']*s['m']+(1-s['beta1'])*g,
          s['beta2']*s['v']+(1-s['beta2'])*g**2,
          s['t']+1)
    }
}

# Generate data
np.random.seed(42)
n_samples, n_features = 1000, 10
X = np.random.randn(n_samples, n_features)
true_theta = np.random.randn(n_features)
y = X @ true_theta + np.random.randn(n_samples) * 0.1

# Split train/val
split = int(0.8 * n_samples)
X_train, X_val = X[:split], X[split:]
y_train, y_val = y[:split], y[split:]

# Compare
results = compare_optimizers(X_train, y_train, X_val, y_val, optimizers, n_epochs=100)

# Summary
print('
=== Optimizer Comparison Summary ===')
for name, res in results.items():
    print(f'{name:15s}: Val Loss={res["final_val_loss"]:.6f}, ' +
          f'Time={res["time"]:.3f}s, Converged@E{res["convergence_epoch"]}')

Using Libraries (torch.optim.*, tensorflow.keras.optimizers.*)

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

def benchmark_optimizer(model_class, dataset, optimizer_class, 
                        optimizer_kwargs, epochs=10):
    """Benchmark a specific optimizer configuration."""
    model = model_class()
    optimizer = optimizer_class(model.parameters(), **optimizer_kwargs)
    criterion = nn.MSELoss()
    
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    
    losses = []
    import time
    start = time.time()
    
    for epoch in range(epochs):
        epoch_loss = 0
        for X, y in dataloader:
            optimizer.zero_grad()
            pred = model(X)
            loss = criterion(pred, y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        losses.append(epoch_loss / len(dataloader))
    
    elapsed = time.time() - start
    return losses, elapsed

# Common optimizer configurations
configs = {
    'SGD': (torch.optim.SGD, {'lr': 0.01}),
    'SGD+Momentum': (torch.optim.SGD, {'lr': 0.01, 'momentum': 0.9}),
    'RMSprop': (torch.optim.RMSprop, {'lr': 0.001}),
    'Adam': (torch.optim.Adam, {'lr': 0.001}),
    'AdamW': (torch.optim.AdamW, {'lr': 0.001, 'weight_decay': 0.01}),
    'Adagrad': (torch.optim.Adagrad, {'lr': 0.01})
}

# TensorFlow
import tensorflow as tf

def get_tf_optimizer(name):
    optimizers = {
        'SGD': tf.keras.optimizers.SGD(learning_rate=0.01),
        'SGD+Momentum': tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
        'RMSprop': tf.keras.optimizers.RMSprop(learning_rate=0.001),
        'Adam': tf.keras.optimizers.Adam(learning_rate=0.001),
        'AdamW': tf.keras.optimizers.AdamW(learning_rate=0.001, weight_decay=0.01)
    }
    return optimizers.get(name)

When to Use

✅ Appropriate Use Cases:

  • SGD
  • SGD+Momentum
  • Adam
  • AdamW
  • RMSprop
  • L-BFGS
  • Adagrad

Common Pitfalls

  • {'pitfall': 'Always using Adam without trying SGD', 'description': 'Adam converges faster initially but may reach worse final minima than well-tuned SGD+momentum.', 'solution': 'For final production models, compare Adam vs SGD+momentum. Try Adam for fast iteration, SGD for final training.'}
  • {'pitfall': 'Wrong learning rate for batch size', 'description': 'Increasing batch size without scaling LR causes slower convergence; linear scaling rule applies.', 'solution': 'When batch size increases by k, multiply LR by k (with warmup). Use LR finder to find good initial value.'}
  • {'pitfall': 'Using Adam with L2 regularization', 'description': "Standard weight decay (L2) doesn't work correctly with Adam's adaptive learning rates.", 'solution': 'Use AdamW which properly decouples weight decay from gradient updates.'}
  • {'pitfall': 'Not using learning rate scheduling', 'description': 'Fixed LR throughout training is almost always suboptimal.', 'solution': 'Always use some form of decay: cosine annealing for vision, linear for NLP, step decay as simple alternative.'}
  • {'pitfall': 'Inappropriate optimizer for problem type', 'description': 'Using L-BFGS with mini-batches or Adam for small convex problems wastes resources.', 'solution': 'Match optimizer to problem: L-BFGS for convex full-batch, SGD/Adam for stochastic deep learning.'}