Gradient Descent
Definition
Gradient Descent is the fundamental optimization algorithm used to minimize differentiable loss functions in machine learning. It iteratively adjusts model parameters in the direction opposite to the gradient of the loss function with respect to the parameters. The core principle is simple yet powerful: compute the gradient (the direction of steepest ascent), then move in the opposite direction by a step size called the learning rate. This process continues until convergence to a local minimum is reached. Gradient descent forms the backbone of modern deep learning optimization, with variants like Stochastic Gradient Descent and Adam building upon this foundational concept. The algorithm is applicable whenever the loss function is differentiable and provides a systematic way to optimize millions or even billions of parameters in neural networks.
Intuition
Imagine standing on a foggy mountainside and wanting to reach the valley floor. You cannot see the entire landscape, but you can feel the slope beneath your feet. Gradient descent works exactly like this: at each position, you determine which way is downhill (the negative gradient) and take a step in that direction. The size of your step is the learning rate—too small and you descend slowly; too large and you might overshoot the valley or even climb back up. In machine learning, the mountain represents the loss landscape where elevation corresponds to error, and your position represents the model parameters. The valleys are regions of low loss where the model makes accurate predictions. Batch gradient descent uses the entire dataset to compute the gradient at each step, giving you the truest sense of the overall slope but requiring significant computation per step.
Mathematical Formula
Step-by-Step Explanation:
- Step 1: Initialize parameters \( h\(\eta_0\)\) randomly or with a specific strategy
- Step 2: Compute the gradient \nabla_\theta L(\theta_t) - the vector of partial derivatives of the loss with respect to each parameter
- Step 3: Scale the gradient by learning rate \(\(\eta\)\) to control step size
- Step 4: Update parameters by subtracting the scaled gradient from current parameters
- Step 5: Repeat until convergence criteria met (gradient magnitude < threshold, max iterations reached, or loss stops decreasing)
Real-World Use Cases
Finding optimal weights that minimize the sum of squared errors between predicted and actual house prices based on features like square footage, bedrooms, and location.
Training a binary classifier for spam detection by optimizing weights to minimize cross-entropy loss on labeled email data.
Training the weights and biases of a feedforward neural network for image classification, where gradient descent propagates error signals backward through the network.
Optimizing matrix factorization models for collaborative filtering to predict user ratings for movies they haven't watched.
Training word embeddings using techniques like Word2Vec or GloVe by minimizing reconstruction loss over large text corpora.
Implementation
Manual Implementation (No Libraries)
import numpy as np
def gradient_descent(X, y, learning_rate=0.01, n_iterations=1000, tolerance=1e-6):
"""
Batch Gradient Descent for linear regression.
Args:
X: Feature matrix (n_samples, n_features)
y: Target vector (n_samples,)
learning_rate: Step size for parameter updates
n_iterations: Maximum number of iterations
tolerance: Convergence threshold for gradient magnitude
Returns:
theta: Optimized parameters
loss_history: Loss at each iteration
"""
n_samples, n_features = X.shape
# Initialize parameters randomly
theta = np.random.randn(n_features) * 0.01
loss_history = []
for iteration in range(n_iterations):
# Compute predictions
y_pred = X @ theta
# Compute loss (Mean Squared Error)
loss = np.mean((y_pred - y) ** 2)
loss_history.append(loss)
# Compute gradient: (2/n) * X^T * (X*theta - y)
gradient = (2 / n_samples) * X.T @ (y_pred - y)
# Check convergence
gradient_norm = np.linalg.norm(gradient)
if gradient_norm < tolerance:
print(f'Converged at iteration {iteration}')
break
# Update parameters
theta = theta - learning_rate * gradient
return theta, loss_history
# Example usage
np.random.seed(42)
X = np.random.randn(100, 3)
true_theta = np.array([2.0, -1.0, 0.5])
y = X @ true_theta + np.random.randn(100) * 0.1
theta_opt, losses = gradient_descent(X, y, learning_rate=0.1, n_iterations=1000)
print(f'Optimized parameters: {theta_opt}')
print(f'Final loss: {losses[-1]:.6f}')
Using Libraries (torch.optim.SGD, tensorflow.keras.optimizers.SGD)
import torch
import torch.nn as nn
# Define model, loss, and optimizer
model = nn.Linear(3, 1) # Linear regression with 3 features
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# Training loop
X = torch.randn(100, 3)
y = torch.randn(100, 1)
losses = []
for epoch in range(1000):
# Forward pass
outputs = model(X)
loss = criterion(outputs, y)
# Backward pass
optimizer.zero_grad()
loss.backward()
# Parameter update
optimizer.step()
losses.append(loss.item())
print(f'Final loss: {losses[-1]:.6f}')
# TensorFlow/Keras equivalent
import tensorflow as tf
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(3,))])
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.1),
loss='mse')
history = model.fit(X.numpy(), y.numpy(), epochs=1000, verbose=0)
print(f'Final loss: {history.history["loss"][-1]:.6f}')
When to Use
✅ Appropriate Use Cases:
- When the dataset is small to medium-sized (fits in memory)
- When deterministic, stable convergence is preferred over speed
- For convex optimization problems where global minimum is guaranteed
- When you need the true gradient for theoretical analysis
- As a baseline to compare more complex optimizers against
- When training simple models like linear/logistic regression
- For educational purposes to understand optimization fundamentals
❌ Avoid When:
- When working with large datasets that don't fit in memory
- When training deep neural networks (use adaptive methods instead)
- When fast iteration and experimentation are priorities
- When the loss landscape is highly non-convex with many saddle points
- When real-time training updates are needed (streaming data)
- When memory is constrained (batch gradient requires storing all data)
Common Pitfalls
- {'pitfall': 'Learning rate too high', 'description': 'Causes divergence where loss increases instead of decreasing. The optimizer overshoots the minimum and bounces around or diverges to infinity.', 'solution': 'Start with small learning rate (0.001-0.01), use learning rate scheduling, or implement learning rate decay.'}
- {'pitfall': 'Learning rate too low', 'description': 'Causes extremely slow convergence, requiring many iterations to reach the minimum. May get stuck in flat regions.', 'solution': 'Use larger initial learning rate with decay, or implement adaptive learning rate methods.'}
- {'pitfall': 'Convergence to local minimum', 'description': 'In non-convex landscapes, gradient descent may converge to a local minimum instead of the global minimum.', 'solution': 'Use multiple random initializations, add momentum, or use stochastic methods that can escape local minima.'}
- {'pitfall': 'Feature scaling issues', 'description': 'Features with different scales cause the loss landscape to be elongated, making convergence slow along certain directions.', 'solution': 'Always standardize or normalize features (zero mean, unit variance) before applying gradient descent.'}
- {'pitfall': 'Saddle points in high dimensions', 'description': 'In high-dimensional spaces, saddle points are more common than local minima and can trap gradient descent.', 'solution': 'Add momentum or use second-order methods that can escape saddle points more effectively.'}