Learning Rate Scheduling
Definition
Learning rate scheduling is the practice of systematically adjusting the learning rate during training to improve convergence properties and final model performance. The learning rate is one of the most critical hyperparameters in gradient-based optimization: too high causes divergence or oscillation, too low leads to slow convergence or trapping in poor local minima. Rather than using a fixed learning rate, scheduling strategies gradually reduce the learning rate according to predefined rules or based on training progress. Common schedules include step decay (reducing by a factor every N epochs), exponential decay (continuous geometric decrease), cosine annealing (smooth cosine curve decay), and warmup (starting small and increasing). Proper scheduling can mean the difference between a model that converges poorly and one that achieves state-of-the-art performance.
Intuition
Think of learning rate as step size when hiking down a mountain. At the beginning, you're far from the destination and the terrain is steep—you want large steps to cover ground quickly (high learning rate). As you approach the valley floor, large steps cause you to overshoot and oscillate around the minimum—you need smaller, more careful steps (lower learning rate). Warmup is like starting your hike slowly to avoid tripping before you find your footing. Step decay is like deciding to take half-size steps after reaching certain milestones. Cosine annealing is like smoothly transitioning from big strides to tiny steps following a natural deceleration curve. Cyclical schedules are like periodically taking bigger steps again to escape local potholes you might have settled into. The goal is to move fast when far away and precisely when close, avoiding both sluggish progress and unstable oscillations.
Mathematical Formula
Step-by-Step Explanation:
- Step Decay: Initial rate \(\eta_0\) multiplied by decay factor \(\gamma\) (typically 0.1-0.5) every s steps/epochs
- Exponential Decay: Continuous decay with rate k; learning rate decreases smoothly as \(e^{-kt}\)
- Cosine Annealing: Follows half-cosine curve from \(\eta_{max}\) to \(\eta_{min}\) over T steps; creates smooth deceleration
- Warmup: Linearly increases from 0 to \(\eta_{max}\) over first w warmup steps; prevents early instability
- Polynomial: Decay following polynomial curve with power p (typically 1 for linear, 2 for quadratic)
- All schedules can be combined: warmup + cosine is common for transformers
- Schedules apply to the base learning rate; adaptive optimizers apply their own per-parameter scaling
Real-World Use Cases
Training ResNet on ImageNet typically uses step decay (reduce LR by 0.1 at epochs 30, 60, 90) or cosine annealing. Proper scheduling is crucial for reaching competitive accuracy.
Training BERT/GPT uses linear warmup (first few thousand steps) followed by linear or cosine decay. Warmup prevents early training instability in transformers.
Fine-tuning pretrained models often uses smaller initial LR with aggressive decay, as the model starts near a good solution and only needs refinement.
Training policies with PPO often uses linear decay to zero over training to ensure stable final convergence.
Training GANs or diffusion models uses carefully tuned schedules; cosine annealing helps stabilize the adversarial training process.
Training models on TPUs/GPUs for days/weeks uses sophisticated schedules combining warmup, decay, and sometimes restarts to maximize final performance.
Implementation
Manual Implementation (No Libraries)
import numpy as np
import matplotlib.pyplot as plt
class LearningRateScheduler:
"""Collection of learning rate scheduling strategies."""
@staticmethod
def step_decay(initial_lr, epoch, drop_every=10, drop_factor=0.5):
"""Step decay: reduce LR by factor every N epochs."""
return initial_lr * (drop_factor ** (epoch // drop_every))
@staticmethod
def exponential_decay(initial_lr, epoch, decay_rate=0.1):
"""Exponential decay: continuous geometric decrease."""
return initial_lr * np.exp(-decay_rate * epoch)
@staticmethod
def cosine_annealing(initial_lr, epoch, total_epochs, \(\eta_{min}\)=0):
"""Cosine annealing: smooth cosine curve decay."""
return \(\eta_{min}\) + 0.5 * (initial_lr - \(\eta_{min}\)) * \
(1 + np.cos(np.pi * epoch / total_epochs))
@staticmethod
def linear_warmup_then_cosine(epoch, warmup_epochs, total_epochs,
\(\eta_{max}\), \(\eta_{min}\)=0):
"""Linear warmup followed by cosine annealing."""
if epoch < warmup_epochs:
return \(\eta_{max}\) * (epoch / warmup_epochs)
else:
progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
return \(\eta_{min}\) + 0.5 * (\(\eta_{max}\) - \(\eta_{min}\)) * (1 + np.cos(np.pi * progress))
@staticmethod
def polynomial_decay(initial_lr, epoch, total_epochs, power=2):
"""Polynomial decay with specified power."""
return initial_lr * (1 - epoch / total_epochs) ** power
@staticmethod
def cyclical_lr(epoch, cycle_length, \(\eta_{min}\), \(\eta_{max}\)):
"""Cyclical learning rate (triangular policy)."""
cycle = epoch % cycle_length
return \(\eta_{min}\) + (\(\eta_{max}\) - \(\eta_{min}\)) * (1 - abs(cycle / (cycle_length/2) - 1))
# Visualization
epochs = np.arange(100)
initial_lr = 0.1
schedules = {
'Step Decay': [LearningRateScheduler.step_decay(initial_lr, e, 30, 0.5)
for e in epochs],
'Exponential Decay': [LearningRateScheduler.exponential_decay(initial_lr, e, 0.05)
for e in epochs],
'Cosine Annealing': [LearningRateScheduler.cosine_annealing(initial_lr, e, 100)
for e in epochs],
'Warmup + Cosine': [LearningRateScheduler.linear_warmup_then_cosine(
e, 10, 100, initial_lr) for e in epochs],
'Polynomial (p=2)': [LearningRateScheduler.polynomial_decay(initial_lr, e, 100, 2)
for e in epochs]
}
plt.figure(figsize=(12, 6))
for name, lrs in schedules.items():
plt.plot(epochs, lrs, label=name, linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('lr_schedules.png', dpi=150, bbox_inches='tight')
plt.show()
# Usage in training loop
def train_with_scheduler(model, X, y, scheduler_fn, initial_lr, n_epochs):
"""Training loop with learning rate scheduling."""
theta = np.random.randn(X.shape[1]) * 0.01
loss_history = []
lr_history = []
for epoch in range(n_epochs):
# Get scheduled learning rate
lr = scheduler_fn(initial_lr, epoch)
lr_history.append(lr)
# Training step with current LR
# (simplified - normally would use batches)
y_pred = X @ theta
gradient = X.T @ (y_pred - y)
theta = theta - lr * gradient / len(X)
loss = np.mean((y_pred - y) ** 2)
loss_history.append(loss)
return theta, loss_history, lr_history
Using Libraries (torch.optim.lr_scheduler, tf.keras.optimizers.schedules, tf.keras.callbacks.LearningRateScheduler)
import torch
from torch.optim.lr_scheduler import StepLR, ExponentialLR, CosineAnnealingLR
from torch.optim.lr_scheduler import LambdaLR, OneCycleLR
import torch.nn as nn
# Setup
model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# Step Decay: reduce LR by 0.5 every 30 epochs
scheduler_step = StepLR(optimizer, step_size=30, \(\gamma\)=0.5)
# Exponential Decay: multiply LR by 0.95 each epoch
scheduler_exp = ExponentialLR(optimizer, \(\gamma\)=0.95)
# Cosine Annealing: decay from 0.1 to 0 over 100 epochs
scheduler_cosine = CosineAnnealingLR(optimizer, T_max=100, \(\eta_{min}\)=0)
# Custom scheduler (warmup + decay)
def lr_lambda(epoch):
if epoch < 10:
return epoch / 10 # Linear warmup
else:
return 0.95 ** (epoch - 10) # Exponential decay
scheduler_custom = LambdaLR(optimizer, lr_lambda)
# One Cycle LR: increases then decreases in one cycle
scheduler_onecycle = OneCycleLR(optimizer, max_lr=0.1,
total_steps=1000,
pct_start=0.3, # 30% warmup
anneal_strategy='cos')
# Training loop with scheduler
print('=== Training with Step Decay ===')
for epoch in range(100):
# ... training code ...
optimizer.step()
scheduler_step.step() # Update learning rate
if epoch % 10 == 0:
current_lr = optimizer.param_groups[0]['lr']
print(f'Epoch {epoch}: LR = {current_lr:.6f}')
# TensorFlow/Keras
import tensorflow as tf
# Step decay
def step_decay_schedule(epoch, lr):
if epoch < 30:
return 0.1
elif epoch < 60:
return 0.01
else:
return 0.001
scheduler_tf = tf.keras.callbacks.LearningRateScheduler(step_decay_schedule)
# Exponential decay
exponential_decay = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=0.1,
decay_steps=1000,
decay_rate=0.96,
staircase=True
)
# Cosine decay
cosine_decay = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate=0.1,
decay_steps=1000,
alpha=0.0 # Minimum LR as fraction of initial
)
# Warmup + Cosine
warmup_cosine = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate=0.1,
decay_steps=1000,
warmup_target=0.1,
warmup_steps=100
)
# Usage
model_tf = tf.keras.Sequential([tf.keras.layers.Dense(1)])
model_tf.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=cosine_decay),
loss='mse'
)
When to Use
✅ Appropriate Use Cases:
- Step Decay: Traditional computer vision training (ImageNet, CIFAR)
- Step Decay: When you have clear training phases and want control over exact LR drops
- Exponential Decay: Continuous adaptation needs, online learning, streaming data
- Cosine Annealing: Smooth, theoretically motivated decay; modern alternative to step decay
- Cosine Annealing: When you want to avoid abrupt LR changes that can destabilize training
- Warmup: Training large models (transformers, deep ResNets) to prevent early divergence
- Warmup: When using large batch sizes that amplify early gradient noise
- Cyclical: Exploring loss landscape, hyperparameter search, escaping local minima
- Polynomial: When you want controlled, smooth decay with adjustable curve shape
❌ Avoid When:
- Step Decay: When smooth, continuous adjustment is preferred (use exponential or cosine)
- Exponential Decay: When you need precise LR control at specific epochs
- Cosine Annealing: Short training runs where cosine curve doesn't fully develop
- Warmup: Small models or when using small batch sizes (unnecessary overhead)
- Cyclical: Production training where you want convergence to a specific minimum
- Any scheduling: Very short training (<10 epochs) where fixed LR suffices
- Aggressive decay: Early in hyperparameter search when you want fast iteration
Common Pitfalls
- {'pitfall': 'Learning rate drops too early', 'description': 'Step decay or aggressive exponential decay reduces LR before model has converged, causing premature stagnation.', 'solution': 'Monitor validation loss; delay decay until loss plateaus. Use cosine annealing for smoother transition.'}
- {'pitfall': 'Insufficient warmup', 'description': 'Large models with large learning rates diverge immediately without adequate warmup.', 'solution': 'Use longer warmup (up to 5-10% of total steps for transformers). Start with very small LR.'}
- {'pitfall': 'Learning rate too low at end', 'description': 'Some schedules decay to near zero, making final fine-tuning impossible.', 'solution': 'Set \\(\\eta_{min}\\) to a small non-zero value (1e-6 or 1e-7) to allow continued learning.'}
- {'pitfall': 'Schedule/optimizer mismatch', 'description': 'Learning rate schedules designed for SGD may not work well with Adam which has adaptive rates.', 'solution': 'Adam typically needs less aggressive scheduling. Linear decay often sufficient.'}
- {'pitfall': 'Wrong schedule for batch size', 'description': 'Large batch training needs different scheduling than small batch; linear scaling rules apply.', 'solution': 'Scale schedule length with batch size. Larger batches can use more aggressive initial LR with proper warmup.'}