Quasi-Newton Methods (L-BFGS)

Definition

Quasi-Newton methods are a family of optimization algorithms that approximate Newton's method without explicitly computing or storing the Hessian matrix. Instead of the expensive O(n²) Hessian storage and O(n³) inversion, these methods build an approximation of the inverse Hessian (or Hessian itself) using only gradient information from previous iterations. The BFGS algorithm (Broyden-Fletcher-Goldfarb-Shanno) is the most popular quasi-Newton method, maintaining a dense approximation with O(n²) storage. L-BFGS (Limited-memory BFGS) is its scalable variant that uses only O(mn) memory where m is a small constant (typically 10-20), making it suitable for problems with thousands to millions of parameters. These methods achieve superlinear convergence—faster than gradient descent but without the prohibitive costs of Newton's method—making them the method of choice for medium-scale optimization problems.

Intuition

💡

Imagine you're navigating a complex landscape and want to know which way to step, but calculating the full curvature map (Hessian) is impossibly expensive. Quasi-Newton methods are like being a smart hiker who remembers how the slope changed between your previous positions. By remembering just a few past locations and gradients (like 'at the ridge, slope was steep east; now in valley, slope is gentle'), you can infer the shape of the terrain without mapping every detail. L-BFGS is the ultra-lightweight version—instead of remembering the entire history, you only keep the last 10-20 positions, like having a short-term memory that forgets old information to save space. This approximation gets you nearly the same benefit as knowing the full curvature but with minimal memory. The 'secant equation' ensures your approximate curvature matches what you've actually observed on your path.

Mathematical Formula

\[ \text{Secant Condition:} \quad B_{k+1} s_k = y_k \text{where:} \quad s_k = \theta_{k+1} - \theta_k, \quad y_k =

abla L_{k+1} -

abla L_k \]

\text{BFGS Update:} \quad B_{k+1} = B_k - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k} + \frac{y_k y_k^T}{y_k^T s_k}

\text{L-BFGS Two-Loop Recursion:}

\[ \quad q =

abla L_k \]

\quad \text{for } i = k-1, ..., k-m:

\quad \quad \alpha_i = \rho_i s_i^T q, \quad q = q - \alpha_i y_i

\quad r = H_k^0 q

\quad \text{for } i = k-m, ..., k-1:

\quad \quad \beta = \rho_i y_i^T r, \quad r = r + s_i (\alpha_i - \beta)

Step-by-Step Explanation:

Secant condition: Approximate Hessian B should map step s_k to gradient change y_k
s_k (step): Difference between current and previous parameter values
y_k (gradient difference): Change in gradient between iterations
BFGS update: Low-rank update to Hessian approximation satisfying secant condition
First term: Removes old curvature information along step direction
Second term: Adds new curvature information from gradient change
L-BFGS: Instead of storing \(B_k\) explicitly, stores last m pairs (s_i, y_i)
Two-loop recursion: Computes H_k * g efficiently using stored vector pairs
ρ_i = 1/(\(y_i^T s_i\)): Scaling factor for each stored pair

Real-World Use Cases

Logistic Regression and GLMs

Training large-scale logistic regression with L-BFGS converges faster than SGD for datasets up to millions of samples, commonly used in LIBLINEAR and scikit-learn.

Multiclass Classification

Training multinomial logistic regression for document classification with thousands of classes (like news categorization) where L-BFGS efficiently handles the parameter space.

Conditional Random Fields

Training CRFs for sequence labeling (NER, POS tagging) where the objective is convex but high-dimensional; L-BFGS is standard in CRF++ and similar tools.

Matrix Factorization

Non-negative matrix factorization and collaborative filtering where alternating least squares or direct optimization with L-BFGS is preferred over SGD.

Hyperparameter Optimization

Optimizing hyperparameters of ML models where each evaluation is expensive; L-BFGS efficiently explores the search space.

Engineering Design

Structural optimization, aerodynamic design, and other engineering problems with expensive simulations where gradient information is available but Hessian is not.

Implementation

Manual Implementation (No Libraries)

The implementation shows the core L-BFGS two-loop recursion. It stores m pairs of (s, y) vectors instead of the full Hessian. The first loop computes intermediate values going backward through history; the second loop going forward produces the final approximate inverse-Hessian-vector product. Memory is O(mn) instead of O(n²). Line search ensures sufficient decrease.

import numpy as np

def lbfgs_optimizer(grad_func, theta0, m=10, max_iter=100, tol=1e-5, c1=1e-4, c2=0.9):
    """
    L-BFGS optimizer with two-loop recursion and Wolfe line search.
    
    Args:
        grad_func: Function returning (loss, gradient) tuple
        theta0: Initial parameters
        m: Memory size (number of vector pairs to store)
        max_iter: Maximum iterations
        tol: Convergence tolerance
        c1, c2: Wolfe condition constants
    """
    theta = theta0.copy()
    loss, grad = grad_func(theta)
    
    # Storage for L-BFGS
    s_list = []  # Step differences
    y_list = []  # Gradient differences
    rho_list = []  # Scaling factors
    
    history = [(loss, np.linalg.norm(grad))]
    
    for k in range(max_iter):
        # Two-loop recursion to compute search direction
        q = grad.copy()
        alphas = []
        
        # First loop (reverse)
        for i in range(len(s_list) - 1, -1, -1):
            rho = rho_list[i]
            s = s_list[i]
            y = y_list[i]
            alpha = rho * np.dot(s, q)
            alphas.append(alpha)
            q = q - alpha * y
        
        # Initial Hessian approximation (scaled identity)
        if len(s_list) > 0:
            \(\gamma\) = np.dot(s_list[-1], y_list[-1]) / np.dot(y_list[-1], y_list[-1])
            r = \(\gamma\) * q
        else:
            r = q  # Steepest descent for first iteration
        
        # Second loop (forward)
        for i in range(len(s_list)):
            rho = rho_list[i]
            s = s_list[i]
            y = y_list[i]
            alpha = alphas[len(alphas) - 1 - i]
            beta = rho * np.dot(y, r)
            r = r + s * (alpha - beta)
        
        # r is now the approximate inverse Hessian times gradient
        p = -r  # Search direction
        
        # Wolfe line search
        alpha = 1.0
        loss_new, grad_new = grad_func(theta + alpha * p)
        
        # Simple backtracking if Wolfe not satisfied
        while loss_new > loss + c1 * alpha * np.dot(grad, p) and alpha > 1e-10:
            alpha *= 0.5
            loss_new, grad_new = grad_func(theta + alpha * p)
        
        # Update parameters
        theta_new = theta + alpha * p
        s = theta_new - theta
        y = grad_new - grad
        
        # Update storage (maintain limited memory)
        if len(s_list) >= m:
            s_list.pop(0)
            y_list.pop(0)
            rho_list.pop(0)
        
        ys = np.dot(y, s)
        if ys > 1e-10:  # Only update if curvature condition satisfied
            s_list.append(s)
            y_list.append(y)
            rho_list.append(1.0 / ys)
        
        theta = theta_new
        loss = loss_new
        grad = grad_new
        
        history.append((loss, np.linalg.norm(grad)))
        
        # Check convergence
        if np.linalg.norm(grad) < tol:
            print(f'Converged at iteration {k+1}')
            break
        
        if (k + 1) % 10 == 0:
            print(f'Iter {k+1}: Loss={loss:.6f}, |grad|={np.linalg.norm(grad):.6f}')
    
    return theta, history

# Test on Rosenbrock function (classic optimization test)
def rosenbrock(x):
    """Rosenbrock function (banana function)."""
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    """Gradient of Rosenbrock function."""
    loss = rosenbrock(x)
    dfdx0 = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
    dfdx1 = 200 * (x[1] - x[0]**2)
    return loss, np.array([dfdx0, dfdx1])

print('=== L-BFGS on Rosenbrock Function ===')
x0 = np.array([-1.0, 2.0])  # Starting far from optimum at (1, 1)
x_opt, history = lbfgs_optimizer(rosenbrock_grad, x0, m=10, max_iter=100)
print(f'
Optimal x: {x_opt}')
print(f'Function value: {rosenbrock(x_opt):.10f}')
print(f'True optimum: (1, 1), f=0')

# Quadratic function test
def quadratic(x):
    A = np.array([[4, 1], [1, 3]])
    b = np.array([-1, 2])
    loss = 0.5 * x @ A @ x + b @ x
    grad = A @ x + b
    return loss, grad

print('
=== L-BFGS on Quadratic Function ===')
x0 = np.array([0.0, 0.0])
x_opt, history = lbfgs_optimizer(quadratic, x0, m=5, max_iter=50)
print(f'Optimal x: {x_opt}')
print(f'Expected: {-np.linalg.inv(np.array([[4,1],[1,3]])) @ np.array([-1, 2])}')

Using Libraries (scipy.optimize.minimize (L-BFGS-B), sklearn.linear_model.LogisticRegression, torch.optim.LBFGS)

import numpy as np
from scipy.optimize import minimize

# scipy provides excellent L-BFGS-B implementation
def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def rosenbrock_grad(x):
    dfdx0 = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
    dfdx1 = 200 * (x[1] - x[0]**2)
    return np.array([dfdx0, dfdx1])

# L-BFGS-B (B = Bounded, supports box constraints)
result = minimize(rosenbrock, x0=[-1, 2], method='L-BFGS-B',
                  jac=rosenbrock_grad,
                  options={'maxiter': 100, 'disp': True})
print(f'
scipy L-BFGS-B result: {result.x}')
print(f'Function value: {result.fun:.10f}')
print(f'Iterations: {result.nit}, Function evals: {result.nfev}')

# For machine learning: sklearn uses L-BFGS for some models
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Use L-BFGS solver
lr = LogisticRegression(solver='lbfgs', max_iter=1000, verbose=1)
lr.fit(X, y)
print(f'
Logistic Regression converged in {lr.n_iter_[0]} iterations')

# PyTorch L-BFGS (limited support)
import torch

def pytorch_lbfgs_example():
    """PyTorch L-BFGS optimizer example."""
    x = torch.tensor([-1.0, 2.0], requires_grad=True)
    
    # L-BFGS requires closure that recomputes loss
    def closure():
        optimizer.zero_grad()
        loss = (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
        loss.backward()
        return loss
    
    optimizer = torch.optim.LBFGS([x], max_iter=20, line_search_fn='strong_wolfe')
    
    for i in range(10):
        loss = optimizer.step(closure)
        if i % 2 == 0:
            print(f'Iter {i}: x={x.detach().numpy()}, loss={loss.item():.6f}')
    
    return x

print('
=== PyTorch L-BFGS ===')
x_pytorch = pytorch_lbfgs_example()

When to Use

✅ Appropriate Use Cases:

For medium-scale problems (100 to 100,000+ parameters)
When the objective function is smooth and gradients are available
Logistic regression, multinomial regression, and convex GLMs
Problems where memory permits O(mn) storage but not O(n²)
When faster convergence than gradient descent is needed
Full-batch training when entire dataset fits in memory
Non-linear least squares problems
When line search is feasible (not mini-batch setting)

❌ Avoid When:

For stochastic/mini-batch training (line search doesn't work well with noisy gradients)
Deep neural networks (millions of parameters, non-convex, stochastic)
Online learning where data arrives sequentially
When function evaluations are cheap but gradient computation is expensive
Very large-scale problems (n > 10^7) where even O(mn) memory is too much
Problems with non-smooth objectives (L1 regularization, ReLU without smoothing)
When you need real-time updates with streaming data

Common Pitfalls

{'pitfall': 'Using L-BFGS with mini-batches', 'description': 'L-BFGS assumes deterministic gradients; stochastic gradients break the curvature approximation.', 'solution': 'Use SGD, Adam, or other stochastic optimizers. L-BFGS is for full-batch optimization.'}
{'pitfall': 'Memory limit exceeded', 'description': 'For very high-dimensional problems, even O(mn) storage can be prohibitive.', 'solution': 'Reduce memory parameter m, or use first-order methods like SGD or conjugate gradient.'}
{'pitfall': 'Poor line search performance', 'description': 'Inexact line search can lead to poor convergence or excessive function evaluations.', 'solution': 'Tune Wolfe condition parameters or use strong Wolfe line search. Monitor line search iterations.'}
{'pitfall': 'Negative curvature ignored', 'description': 'Skipping updates when y^T s <= 0 discards valid information in non-convex problems.', 'solution': 'Use damped BFGS variants or SR1 update for indefinite Hessians.'}
{'pitfall': 'Incorrect gradient computation', 'description': 'L-BFGS is sensitive to gradient accuracy; numerical errors can cause divergence.', 'solution': 'Verify gradients with finite differences, use automatic differentiation.'}