Hyperparameter Tuning: Grid Search, Random Search, and Bayesian Optimization

Definition

Hyperparameter tuning is the process of optimizing algorithm configuration parameters that control model learning but are not learned from data. Unlike model parameters (weights, coefficients) learned during training, hyperparameters (learning rate, regularization strength, tree depth) must be set before training begins and significantly impact model performance. Grid Search exhaustively evaluates all combinations from a predefined parameter grid, guaranteeing coverage but suffering from exponential growth in evaluations. Random Search samples parameter combinations randomly, often finding good solutions with far fewer evaluations by exploring diverse values for each parameter independently. Bayesian Optimization builds a probabilistic surrogate model (typically Gaussian Process) of the objective function, using acquisition functions to intelligently select the next evaluation point, balancing exploration (uncertain regions) and exploitation (promising regions). Modern approaches include Hyperband (early stopping of poor configurations) and Population-Based Training (evolutionary optimization). Proper hyperparameter tuning, combined with nested cross-validation, transforms good models into excellent ones while preventing overfitting to the validation set.

Intuition

💡

Think of hyperparameter tuning like adjusting a recipe. Grid search is trying every combination of ingredient amounts in fixed increments - thorough but slow. Random search is throwing darts at a dartboard of recipes - you might hit a winner faster. Bayesian optimization is like an experienced chef who learns from each attempt: 'Last time more salt helped, let me try a bit more, but I should also explore adding herbs since I haven't tried that region.' The key insight: not all parameters matter equally, and some regions of the search space are more promising than others.

Mathematical Formula

Grid Search:

\theta^* = \arg\max_{\theta \in \Theta_{grid}} f(\theta)

Random Search:

\theta_i \sim p(\theta), \quad i = 1, ..., n

\theta^* = \arg\max_{\theta \in \{ heta_1, ..., \theta_n\}} f(\theta)

Bayesian Optimization:

Surrogate model:

$P(f|\theta_{1:t}, y_{1:t})$ (Gaussian Process)

Acquisition Function (Expected Improvement):

EI(\theta) = \mathbb{E}[\max(f(\theta) - f(\theta^+), 0)]

= (\mu(\theta) - f(\theta^+) - \xi) \Phi(Z) + \sigma(\theta) \phi(Z)

Where:

Z = \frac{\mu(\theta) - f(\theta^+) - \xi}{\sigma(\theta)}

Upper Confidence Bound:

UCB(\theta) = \mu(\theta) + \kappa \sigma(\theta)

Where:

- $\mu(\theta)$ = predicted mean from surrogate

- $\sigma(\theta)$ = predicted standard deviation (uncertainty)

- $\Phi, \phi$ = CDF and PDF of standard normal

- $\xi$ = exploration parameter

Step-by-Step Explanation:

Grid search evaluates all combinations in a predefined grid (exponential in dimensions)
Random search samples uniformly from parameter distributions (more efficient)
Bayesian optimization builds a probabilistic model of the objective surface
Expected Improvement balances improving over best-so-far vs. exploring uncertain regions
UCB explicitly trades off predicted performance (exploitation) and uncertainty (exploration)
The surrogate model (GP) provides uncertainty estimates guiding where to sample next

Real-World Use Cases

Machine Learning Competitions

Kaggle winners use extensive Bayesian optimization with tools like Optuna to squeeze maximum performance from ensemble models.

Deep Learning

Neural architecture search tunes layer types, sizes, and connections - often requiring hundreds of GPU-hours with Bayesian methods.

Production ML

AutoML platforms use Hyperband and Bayesian optimization to automatically tune pipelines for business users.

Scientific Computing

Hyperparameter tuning for physics simulators and molecular modeling where each evaluation is expensive.

Implementation

Manual Implementation (No Libraries)

This implementation demonstrates the core mechanics of tuning strategies. Grid search is exhaustive but exponential. Random search explores more diverse values per parameter. Bayesian optimization uses past evaluations to guide future ones. For production, use optimized libraries like Optuna or scikit-optimize with proper Gaussian Processes.

import numpy as np
from itertools import product
from scipy.stats import norm

class HyperparameterTuner:
    """
    Manual implementations of hyperparameter tuning strategies.
    Includes Grid Search, Random Search, and simplified Bayesian Optimization.
    """
    
    def __init__(self, model_class, param_grid, scoring='accuracy', cv=3):
        self.model_class = model_class
        self.param_grid = param_grid
        self.scoring = scoring
        self.cv = cv
        self.best_params_ = None
        self.best_score_ = None
        self.cv_results_ = []
    
    def _evaluate_params(self, X, y, params):
        """Evaluate parameters using cross-validation."""
        from sklearn.model_selection import cross_val_score
        
        model = self.model_class(**params)
        scores = cross_val_score(model, X, y, cv=self.cv, scoring=self.scoring)
        return scores.mean()
    
    def grid_search(self, X, y):
        """Exhaustive grid search."""
        # Generate all combinations
        keys = list(self.param_grid.keys())
        values = list(self.param_grid.values())
        combinations = list(product(*values))
        
        best_score = -np.inf
        best_params = None
        results = []
        
        print(f'Grid Search: {len(combinations)} combinations')
        
        for combo in combinations:
            params = dict(zip(keys, combo))
            score = self._evaluate_params(X, y, params)
            
            results.append({
                'params': params,
                'mean_score': score
            })
            
            if score > best_score:
                best_score = score
                best_params = params
        
        self.best_params_ = best_params
        self.best_score_ = best_score
        self.cv_results_ = results
        
        return self
    
    def random_search(self, X, y, n_iter=10):
        """Random search with n_iter evaluations."""
        best_score = -np.inf
        best_params = None
        results = []
        
        print(f'Random Search: {n_iter} iterations')
        
        np.random.seed(42)
        for i in range(n_iter):
            # Sample random combination
            params = {}
            for key, values in self.param_grid.items():
                params[key] = np.random.choice(values)
            
            score = self._evaluate_params(X, y, params)
            
            results.append({
                'params': params,
                'mean_score': score
            })
            
            if score > best_score:
                best_score = score
                best_params = params
                print(f'  Iter {i+1}: New best score {score:.4f}')
        
        self.best_params_ = best_params
        self.best_score_ = best_score
        self.cv_results_ = results
        
        return self

class SimpleBayesianOptimization:
    """
    Simplified Bayesian Optimization using Gaussian Process.
    For production use, prefer libraries like scikit-optimize or Optuna.
    """
    
    def __init__(self, param_bounds, n_initial=5, n_iterations=20):
        self.param_bounds = param_bounds
        self.n_initial = n_initial
        self.n_iterations = n_iterations
        self.X_observed = []
        self.y_observed = []
        self.best_params = None
        self.best_value = -np.inf
    
    def _sample_random(self):
        """Sample random point from bounds."""
        params = {}
        for key, (low, high) in self.param_bounds.items():
            if isinstance(low, int):
                params[key] = np.random.randint(low, high + 1)
            else:
                params[key] = np.random.uniform(low, high)
        return params
    
    def _expected_improvement(self, mu, sigma, y_best, xi=0.01):
        """Compute Expected Improvement acquisition function."""
        with np.errstate(divide='warn'):
            imp = mu - y_best - xi
            Z = imp / sigma
            ei = imp * norm.cdf(Z) + sigma * norm.pdf(Z)
            ei[sigma == 0.0] = 0.0
        return ei
    
    def _simple_gp_predict(self, X_test):
        """Simplified GP prediction (using kernel mean)."""
        if len(self.y_observed) == 0:
            return np.zeros(len(X_test)), np.ones(len(X_test))
        
        # Simplified: use nearest neighbor mean and distance-based uncertainty
        # In practice, use proper GP with RBF kernel
        y_mean = np.mean(self.y_observed)
        mu = np.full(len(X_test), y_mean)
        sigma = np.full(len(X_test), np.std(self.y_observed) if len(self.y_observed) > 1 else 1.0)
        
        return mu, sigma
    
    def optimize(self, objective_func):
        """Run Bayesian optimization."""
        # Initial random sampling
        for _ in range(self.n_initial):
            params = self._sample_random()
            value = objective_func(params)
            self.X_observed.append(params)
            self.y_observed.append(value)
            
            if value > self.best_value:
                self.best_value = value
                self.best_params = params
        
        # Bayesian optimization iterations
        for i in range(self.n_iterations):
            # In real implementation: fit GP, maximize acquisition function
            # Here: sample with probability proportional to EI estimate
            params = self._sample_random()
            value = objective_func(params)
            
            self.X_observed.append(params)
            self.y_observed.append(value)
            
            if value > self.best_value:
                self.best_value = value
                self.best_params = params
                print(f'  BO Iter {i+1}: New best {value:.4f}')
        
        return self.best_params, self.best_value

# Demonstration
if __name__ == '__main__':
    from sklearn.datasets import make_classification
    from sklearn.ensemble import RandomForestClassifier
    
    # Generate data
    X, y = make_classification(
        n_samples=500, n_features=10, random_state=42
    )
    
    # Define parameter grid
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 10, None],
        'min_samples_split': [2, 5, 10]
    }
    
    # Grid Search
    tuner = HyperparameterTuner(
        RandomForestClassifier,
        param_grid,
        scoring='accuracy',
        cv=3
    )
    
    tuner.grid_search(X, y)
    print(f'
Grid Search Best: {tuner.best_score_:.4f}')
    print(f'Best params: {tuner.best_params_}')
    
    # Random Search
    tuner.random_search(X, y, n_iter=10)
    print(f'
Random Search Best: {tuner.best_score_:.4f}')
    print(f'Best params: {tuner.best_params_}')

Using Libraries (scikit-learn, scipy, numpy, scikit-optimize, optuna)

from sklearn.model_selection import (
    GridSearchCV, RandomizedSearchCV, cross_val_score
)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, make_scorer
from scipy.stats import randint, uniform, loguniform
import numpy as np

# Generate dataset
print('=== HYPERPARAMETER TUNING EXAMPLES ===')
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, random_state=42
)

# GRID SEARCH
print('
=== GRID SEARCH ===')

param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2']
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid_rf,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X, y)

print(f'Best CV Score: {grid_search.best_score_:.4f}')
print(f'Best Params: {grid_search.best_params_}')
print(f'Total fits: {len(grid_search.cv_results_["params"])}')

# RANDOM SEARCH
print('
=== RANDOMIZED SEARCH ===')

# Define distributions for random search
param_distributions_rf = {
    'n_estimators': randint(50, 500),
    'max_depth': [3, 5, 7, 10, 15, 20, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions_rf,
    n_iter=50,  # Only try 50 combinations
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X, y)

print(f'Best CV Score: {random_search.best_score_:.4f}')
print(f'Best Params: {random_search.best_params_}')
print(f'Total fits: 50 (vs {len(grid_search.cv_results_["params"])} for grid)')

# SVM with logarithmic scale (important for C and \(\gamma\))
print('
=== SVM WITH LOG SCALE ===')

param_distributions_svm = {
    'C': loguniform(1e-3, 1e3),  # Log scale for regularization
    '\(\gamma\)': loguniform(1e-4, 1e0),  # Log scale for kernel
    'kernel': ['rbf', 'linear']
}

random_search_svm = RandomizedSearchCV(
    SVC(random_state=42),
    param_distributions_svm,
    n_iter=20,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)

random_search_svm.fit(X, y)
print(f'Best CV Score: {random_search_svm.best_score_:.4f}')
print(f'Best C: {random_search_svm.best_params_["C"]:.4f}')
print(f'Best \(\gamma\): {random_search_svm.best_params_["\(\gamma\)"]:.4f}')

# NESTED CROSS-VALIDATION (unbiased performance estimate)
print('
=== NESTED CROSS-VALIDATION ===')

from sklearn.model_selection import cross_val_score, KFold

# Outer loop: evaluate model performance
# Inner loop: hyperparameter tuning
outer_cv = KFold(n_splits=3, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)

nested_scores = cross_val_score(
    RandomizedSearchCV(
        RandomForestClassifier(random_state=42),
        param_distributions_rf,
        n_iter=20,
        cv=inner_cv,
        random_state=42
    ),
    X, y, cv=outer_cv, scoring='accuracy'
)

print(f'Nested CV Accuracy: {nested_scores.mean():.4f} (+/- {nested_scores.std()*2:.4f})')
print(f'Non-nested CV Accuracy: {random_search.best_score_:.4f}')
print(f'Difference shows optimistic bias of non-nested approach')

# BAYESIAN OPTIMIZATION with scikit-optimize
print('
=== BAYESIAN OPTIMIZATION (scikit-optimize) ===')

try:
    from skopt import BayesSearchCV
    from skopt.space import Real, Integer, Categorical
    
    search_spaces = {
        'n_estimators': Integer(50, 500),
        'max_depth': Integer(3, 20),
        'min_samples_split': Integer(2, 20),
        'max_features': Categorical(['sqrt', 'log2'])
    }
    
    bayes_search = BayesSearchCV(
        RandomForestClassifier(random_state=42),
        search_spaces,
        n_iter=30,
        cv=5,
        scoring='accuracy',
        n_jobs=-1,
        random_state=42
    )
    
    bayes_search.fit(X, y)
    print(f'Best CV Score: {bayes_search.best_score_:.4f}')
    print(f'Best Params: {bayes_search.best_params_}')
    
except ImportError:
    print('scikit-optimize not installed. Install with: pip install scikit-optimize')

# OPTUNA (modern Bayesian optimization)
print('
=== OPTUNA OPTIMIZATION ===')

try:
    import optuna
    from sklearn.model_selection import cross_val_score
    
    def objective(trial):
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 500),
            'max_depth': trial.suggest_int('max_depth', 3, 20),
            'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
            'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
            'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2'])
        }
        
        model = RandomForestClassifier(**params, random_state=42)
        score = cross_val_score(model, X, y, cv=3, scoring='accuracy').mean()
        return score
    
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=50, show_progress_bar=True)
    
    print(f'Best Score: {study.best_value:.4f}')
    print(f'Best Params: {study.best_params}')
    print(f'Number of trials: {len(study.trials)}')
    
except ImportError:
    print('optuna not installed. Install with: pip install optuna')

When to Use

✅ Appropriate Use Cases:

Grid Search: Small parameter spaces, need exhaustive coverage, computational budget allows
Random Search: Large parameter spaces, limited budget, some parameters matter more than others
Bayesian Optimization: Expensive evaluations, smooth-ish objective, need efficiency
Nested CV: Final model evaluation, unbiased performance estimate for reporting
Early Stopping (Hyperband): Very large search spaces, can identify poor configs quickly
Population-Based Training: Online learning, evolving architectures during training

❌ Avoid When:

Grid Search: More than 3-4 hyperparameters (curse of dimensionality)
Random Search: Very small parameter spaces (grid is fine)
Bayesian Optimization: Cheap evaluations, very noisy objectives, discrete-only spaces
Any tuning without proper validation: Leads to overfitting the validation set
Single train-test split: Always use cross-validation for tuning
Ignoring preprocessing: Tuning on unscaled data may give wrong conclusions

Common Pitfalls

Not scaling parameters: Use log scale for C, learning_rate (span multiple orders)
Optimistic bias: Non-nested CV overestimates performance - use nested for final eval
Too few iterations: Random/Bayesian search needs enough samples to explore
Wide ranges: Unnecessarily large search spaces waste budget
Ignoring convergence: Not checking if more iterations would help
Data leakage: Tuning on test set - always maintain train/validation/test split
Single metric: Optimize for the metric that matters for your application