Learning Rate Scheduling

Definition

Learning rate scheduling is the practice of systematically adjusting the learning rate during training to improve convergence properties and final model performance. The learning rate is one of the most critical hyperparameters in gradient-based optimization: too high causes divergence or oscillation, too low leads to slow convergence or trapping in poor local minima. Rather than using a fixed learning rate, scheduling strategies gradually reduce the learning rate according to predefined rules or based on training progress. Common schedules include step decay (reducing by a factor every N epochs), exponential decay (continuous geometric decrease), cosine annealing (smooth cosine curve decay), and warmup (starting small and increasing). Proper scheduling can mean the difference between a model that converges poorly and one that achieves state-of-the-art performance.

Intuition

💡

Think of learning rate as step size when hiking down a mountain. At the beginning, you're far from the destination and the terrain is steep—you want large steps to cover ground quickly (high learning rate). As you approach the valley floor, large steps cause you to overshoot and oscillate around the minimum—you need smaller, more careful steps (lower learning rate). Warmup is like starting your hike slowly to avoid tripping before you find your footing. Step decay is like deciding to take half-size steps after reaching certain milestones. Cosine annealing is like smoothly transitioning from big strides to tiny steps following a natural deceleration curve. Cyclical schedules are like periodically taking bigger steps again to escape local potholes you might have settled into. The goal is to move fast when far away and precisely when close, avoiding both sluggish progress and unstable oscillations.

Mathematical Formula

\text{Step Decay:} \quad \eta_t = \eta_0 \times \gamma^{\lfloor t / s \rfloor} \text{Exponential Decay:} \quad \eta_t = \eta_0 \times e^{-kt} \text{Cosine Annealing:} \quad \eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{t}{T}\pi)) \text{Warmup:} \quad \eta_t = \eta_{max} \times \min(1, \frac{t}{w}) \text{Polynomial:} \quad \eta_t = \eta_0 \times (1 - \frac{t}{T})^p

Step-by-Step Explanation:

Step Decay: Initial rate \(\eta_0\) multiplied by decay factor \(\gamma\) (typically 0.1-0.5) every s steps/epochs
Exponential Decay: Continuous decay with rate k; learning rate decreases smoothly as \(e^{-kt}\)
Cosine Annealing: Follows half-cosine curve from \(\eta_{max}\) to \(\eta_{min}\) over T steps; creates smooth deceleration
Warmup: Linearly increases from 0 to \(\eta_{max}\) over first w warmup steps; prevents early instability
Polynomial: Decay following polynomial curve with power p (typically 1 for linear, 2 for quadratic)
All schedules can be combined: warmup + cosine is common for transformers
Schedules apply to the base learning rate; adaptive optimizers apply their own per-parameter scaling

Real-World Use Cases

Computer Vision

Training ResNet on ImageNet typically uses step decay (reduce LR by 0.1 at epochs 30, 60, 90) or cosine annealing. Proper scheduling is crucial for reaching competitive accuracy.

Natural Language Processing

Training BERT/GPT uses linear warmup (first few thousand steps) followed by linear or cosine decay. Warmup prevents early training instability in transformers.

Transfer Learning

Fine-tuning pretrained models often uses smaller initial LR with aggressive decay, as the model starts near a good solution and only needs refinement.

Reinforcement Learning

Training policies with PPO often uses linear decay to zero over training to ensure stable final convergence.

Generative Models

Training GANs or diffusion models uses carefully tuned schedules; cosine annealing helps stabilize the adversarial training process.

Large-Scale Training

Training models on TPUs/GPUs for days/weeks uses sophisticated schedules combining warmup, decay, and sometimes restarts to maximize final performance.

Implementation

Manual Implementation (No Libraries)

The implementation shows multiple scheduling strategies. Step decay drops LR at fixed intervals. Exponential decay provides continuous decrease. Cosine annealing follows smooth cosine curve. Warmup starts small and increases, essential for training large models. Cyclical LR periodically increases then decreases LR. Each schedule has different convergence properties suited for different scenarios.

import numpy as np
import matplotlib.pyplot as plt

class LearningRateScheduler:
    """Collection of learning rate scheduling strategies."""
    
    @staticmethod
    def step_decay(initial_lr, epoch, drop_every=10, drop_factor=0.5):
        """Step decay: reduce LR by factor every N epochs."""
        return initial_lr * (drop_factor ** (epoch // drop_every))
    
    @staticmethod
    def exponential_decay(initial_lr, epoch, decay_rate=0.1):
        """Exponential decay: continuous geometric decrease."""
        return initial_lr * np.exp(-decay_rate * epoch)
    
    @staticmethod
    def cosine_annealing(initial_lr, epoch, total_epochs, \(\eta_{min}\)=0):
        """Cosine annealing: smooth cosine curve decay."""
        return \(\eta_{min}\) + 0.5 * (initial_lr - \(\eta_{min}\)) * \
               (1 + np.cos(np.pi * epoch / total_epochs))
    
    @staticmethod
    def linear_warmup_then_cosine(epoch, warmup_epochs, total_epochs, 
                                   \(\eta_{max}\), \(\eta_{min}\)=0):
        """Linear warmup followed by cosine annealing."""
        if epoch < warmup_epochs:
            return \(\eta_{max}\) * (epoch / warmup_epochs)
        else:
            progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
            return \(\eta_{min}\) + 0.5 * (\(\eta_{max}\) - \(\eta_{min}\)) * (1 + np.cos(np.pi * progress))
    
    @staticmethod
    def polynomial_decay(initial_lr, epoch, total_epochs, power=2):
        """Polynomial decay with specified power."""
        return initial_lr * (1 - epoch / total_epochs) ** power
    
    @staticmethod
    def cyclical_lr(epoch, cycle_length, \(\eta_{min}\), \(\eta_{max}\)):
        """Cyclical learning rate (triangular policy)."""
        cycle = epoch % cycle_length
        return \(\eta_{min}\) + (\(\eta_{max}\) - \(\eta_{min}\)) * (1 - abs(cycle / (cycle_length/2) - 1))

# Visualization
epochs = np.arange(100)
initial_lr = 0.1

schedules = {
    'Step Decay': [LearningRateScheduler.step_decay(initial_lr, e, 30, 0.5) 
                   for e in epochs],
    'Exponential Decay': [LearningRateScheduler.exponential_decay(initial_lr, e, 0.05) 
                          for e in epochs],
    'Cosine Annealing': [LearningRateScheduler.cosine_annealing(initial_lr, e, 100) 
                         for e in epochs],
    'Warmup + Cosine': [LearningRateScheduler.linear_warmup_then_cosine(
                        e, 10, 100, initial_lr) for e in epochs],
    'Polynomial (p=2)': [LearningRateScheduler.polynomial_decay(initial_lr, e, 100, 2) 
                         for e in epochs]
}

plt.figure(figsize=(12, 6))
for name, lrs in schedules.items():
    plt.plot(epochs, lrs, label=name, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('lr_schedules.png', dpi=150, bbox_inches='tight')
plt.show()

# Usage in training loop
def train_with_scheduler(model, X, y, scheduler_fn, initial_lr, n_epochs):
    """Training loop with learning rate scheduling."""
    theta = np.random.randn(X.shape[1]) * 0.01
    loss_history = []
    lr_history = []
    
    for epoch in range(n_epochs):
        # Get scheduled learning rate
        lr = scheduler_fn(initial_lr, epoch)
        lr_history.append(lr)
        
        # Training step with current LR
        # (simplified - normally would use batches)
        y_pred = X @ theta
        gradient = X.T @ (y_pred - y)
        theta = theta - lr * gradient / len(X)
        
        loss = np.mean((y_pred - y) ** 2)
        loss_history.append(loss)
    
    return theta, loss_history, lr_history

Using Libraries (torch.optim.lr_scheduler, tf.keras.optimizers.schedules, tf.keras.callbacks.LearningRateScheduler)

import torch
from torch.optim.lr_scheduler import StepLR, ExponentialLR, CosineAnnealingLR
from torch.optim.lr_scheduler import LambdaLR, OneCycleLR
import torch.nn as nn

# Setup
model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Step Decay: reduce LR by 0.5 every 30 epochs
scheduler_step = StepLR(optimizer, step_size=30, \(\gamma\)=0.5)

# Exponential Decay: multiply LR by 0.95 each epoch
scheduler_exp = ExponentialLR(optimizer, \(\gamma\)=0.95)

# Cosine Annealing: decay from 0.1 to 0 over 100 epochs
scheduler_cosine = CosineAnnealingLR(optimizer, T_max=100, \(\eta_{min}\)=0)

# Custom scheduler (warmup + decay)
def lr_lambda(epoch):
    if epoch < 10:
        return epoch / 10  # Linear warmup
    else:
        return 0.95 ** (epoch - 10)  # Exponential decay

scheduler_custom = LambdaLR(optimizer, lr_lambda)

# One Cycle LR: increases then decreases in one cycle
scheduler_onecycle = OneCycleLR(optimizer, max_lr=0.1, 
                                total_steps=1000,
                                pct_start=0.3,  # 30% warmup
                                anneal_strategy='cos')

# Training loop with scheduler
print('=== Training with Step Decay ===')
for epoch in range(100):
    # ... training code ...
    optimizer.step()
    scheduler_step.step()  # Update learning rate
    
    if epoch % 10 == 0:
        current_lr = optimizer.param_groups[0]['lr']
        print(f'Epoch {epoch}: LR = {current_lr:.6f}')

# TensorFlow/Keras
import tensorflow as tf

# Step decay
def step_decay_schedule(epoch, lr):
    if epoch < 30:
        return 0.1
    elif epoch < 60:
        return 0.01
    else:
        return 0.001

scheduler_tf = tf.keras.callbacks.LearningRateScheduler(step_decay_schedule)

# Exponential decay
exponential_decay = tf.keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.1,
    decay_steps=1000,
    decay_rate=0.96,
    staircase=True
)

# Cosine decay
cosine_decay = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.1,
    decay_steps=1000,
    alpha=0.0  # Minimum LR as fraction of initial
)

# Warmup + Cosine
warmup_cosine = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.1,
    decay_steps=1000,
    warmup_target=0.1,
    warmup_steps=100
)

# Usage
model_tf = tf.keras.Sequential([tf.keras.layers.Dense(1)])
model_tf.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=cosine_decay),
    loss='mse'
)

When to Use

✅ Appropriate Use Cases:

Step Decay: Traditional computer vision training (ImageNet, CIFAR)
Step Decay: When you have clear training phases and want control over exact LR drops
Exponential Decay: Continuous adaptation needs, online learning, streaming data
Cosine Annealing: Smooth, theoretically motivated decay; modern alternative to step decay
Cosine Annealing: When you want to avoid abrupt LR changes that can destabilize training
Warmup: Training large models (transformers, deep ResNets) to prevent early divergence
Warmup: When using large batch sizes that amplify early gradient noise
Cyclical: Exploring loss landscape, hyperparameter search, escaping local minima
Polynomial: When you want controlled, smooth decay with adjustable curve shape

❌ Avoid When:

Step Decay: When smooth, continuous adjustment is preferred (use exponential or cosine)
Exponential Decay: When you need precise LR control at specific epochs
Cosine Annealing: Short training runs where cosine curve doesn't fully develop
Warmup: Small models or when using small batch sizes (unnecessary overhead)
Cyclical: Production training where you want convergence to a specific minimum
Any scheduling: Very short training (<10 epochs) where fixed LR suffices
Aggressive decay: Early in hyperparameter search when you want fast iteration

Common Pitfalls

{'pitfall': 'Learning rate drops too early', 'description': 'Step decay or aggressive exponential decay reduces LR before model has converged, causing premature stagnation.', 'solution': 'Monitor validation loss; delay decay until loss plateaus. Use cosine annealing for smoother transition.'}
{'pitfall': 'Insufficient warmup', 'description': 'Large models with large learning rates diverge immediately without adequate warmup.', 'solution': 'Use longer warmup (up to 5-10% of total steps for transformers). Start with very small LR.'}
{'pitfall': 'Learning rate too low at end', 'description': 'Some schedules decay to near zero, making final fine-tuning impossible.', 'solution': 'Set \\(\\eta_{min}\\) to a small non-zero value (1e-6 or 1e-7) to allow continued learning.'}
{'pitfall': 'Schedule/optimizer mismatch', 'description': 'Learning rate schedules designed for SGD may not work well with Adam which has adaptive rates.', 'solution': 'Adam typically needs less aggressive scheduling. Linear decay often sufficient.'}
{'pitfall': 'Wrong schedule for batch size', 'description': 'Large batch training needs different scheduling than small batch; linear scaling rules apply.', 'solution': 'Scale schedule length with batch size. Larger batches can use more aggressive initial LR with proper warmup.'}