Hyperparameter Tuning: Grid Search, Random Search, and Bayesian Optimization
Definition
Hyperparameter tuning is the process of optimizing algorithm configuration parameters that control model learning but are not learned from data. Unlike model parameters (weights, coefficients) learned during training, hyperparameters (learning rate, regularization strength, tree depth) must be set before training begins and significantly impact model performance. Grid Search exhaustively evaluates all combinations from a predefined parameter grid, guaranteeing coverage but suffering from exponential growth in evaluations. Random Search samples parameter combinations randomly, often finding good solutions with far fewer evaluations by exploring diverse values for each parameter independently. Bayesian Optimization builds a probabilistic surrogate model (typically Gaussian Process) of the objective function, using acquisition functions to intelligently select the next evaluation point, balancing exploration (uncertain regions) and exploitation (promising regions). Modern approaches include Hyperband (early stopping of poor configurations) and Population-Based Training (evolutionary optimization). Proper hyperparameter tuning, combined with nested cross-validation, transforms good models into excellent ones while preventing overfitting to the validation set.
Intuition
Think of hyperparameter tuning like adjusting a recipe. Grid search is trying every combination of ingredient amounts in fixed increments - thorough but slow. Random search is throwing darts at a dartboard of recipes - you might hit a winner faster. Bayesian optimization is like an experienced chef who learns from each attempt: 'Last time more salt helped, let me try a bit more, but I should also explore adding herbs since I haven't tried that region.' The key insight: not all parameters matter equally, and some regions of the search space are more promising than others.
Mathematical Formula
Step-by-Step Explanation:
- Grid search evaluates all combinations in a predefined grid (exponential in dimensions)
- Random search samples uniformly from parameter distributions (more efficient)
- Bayesian optimization builds a probabilistic model of the objective surface
- Expected Improvement balances improving over best-so-far vs. exploring uncertain regions
- UCB explicitly trades off predicted performance (exploitation) and uncertainty (exploration)
- The surrogate model (GP) provides uncertainty estimates guiding where to sample next
Real-World Use Cases
Kaggle winners use extensive Bayesian optimization with tools like Optuna to squeeze maximum performance from ensemble models.
Neural architecture search tunes layer types, sizes, and connections - often requiring hundreds of GPU-hours with Bayesian methods.
AutoML platforms use Hyperband and Bayesian optimization to automatically tune pipelines for business users.
Hyperparameter tuning for physics simulators and molecular modeling where each evaluation is expensive.
Implementation
Manual Implementation (No Libraries)
import numpy as np
from itertools import product
from scipy.stats import norm
class HyperparameterTuner:
"""
Manual implementations of hyperparameter tuning strategies.
Includes Grid Search, Random Search, and simplified Bayesian Optimization.
"""
def __init__(self, model_class, param_grid, scoring='accuracy', cv=3):
self.model_class = model_class
self.param_grid = param_grid
self.scoring = scoring
self.cv = cv
self.best_params_ = None
self.best_score_ = None
self.cv_results_ = []
def _evaluate_params(self, X, y, params):
"""Evaluate parameters using cross-validation."""
from sklearn.model_selection import cross_val_score
model = self.model_class(**params)
scores = cross_val_score(model, X, y, cv=self.cv, scoring=self.scoring)
return scores.mean()
def grid_search(self, X, y):
"""Exhaustive grid search."""
# Generate all combinations
keys = list(self.param_grid.keys())
values = list(self.param_grid.values())
combinations = list(product(*values))
best_score = -np.inf
best_params = None
results = []
print(f'Grid Search: {len(combinations)} combinations')
for combo in combinations:
params = dict(zip(keys, combo))
score = self._evaluate_params(X, y, params)
results.append({
'params': params,
'mean_score': score
})
if score > best_score:
best_score = score
best_params = params
self.best_params_ = best_params
self.best_score_ = best_score
self.cv_results_ = results
return self
def random_search(self, X, y, n_iter=10):
"""Random search with n_iter evaluations."""
best_score = -np.inf
best_params = None
results = []
print(f'Random Search: {n_iter} iterations')
np.random.seed(42)
for i in range(n_iter):
# Sample random combination
params = {}
for key, values in self.param_grid.items():
params[key] = np.random.choice(values)
score = self._evaluate_params(X, y, params)
results.append({
'params': params,
'mean_score': score
})
if score > best_score:
best_score = score
best_params = params
print(f' Iter {i+1}: New best score {score:.4f}')
self.best_params_ = best_params
self.best_score_ = best_score
self.cv_results_ = results
return self
class SimpleBayesianOptimization:
"""
Simplified Bayesian Optimization using Gaussian Process.
For production use, prefer libraries like scikit-optimize or Optuna.
"""
def __init__(self, param_bounds, n_initial=5, n_iterations=20):
self.param_bounds = param_bounds
self.n_initial = n_initial
self.n_iterations = n_iterations
self.X_observed = []
self.y_observed = []
self.best_params = None
self.best_value = -np.inf
def _sample_random(self):
"""Sample random point from bounds."""
params = {}
for key, (low, high) in self.param_bounds.items():
if isinstance(low, int):
params[key] = np.random.randint(low, high + 1)
else:
params[key] = np.random.uniform(low, high)
return params
def _expected_improvement(self, mu, sigma, y_best, xi=0.01):
"""Compute Expected Improvement acquisition function."""
with np.errstate(divide='warn'):
imp = mu - y_best - xi
Z = imp / sigma
ei = imp * norm.cdf(Z) + sigma * norm.pdf(Z)
ei[sigma == 0.0] = 0.0
return ei
def _simple_gp_predict(self, X_test):
"""Simplified GP prediction (using kernel mean)."""
if len(self.y_observed) == 0:
return np.zeros(len(X_test)), np.ones(len(X_test))
# Simplified: use nearest neighbor mean and distance-based uncertainty
# In practice, use proper GP with RBF kernel
y_mean = np.mean(self.y_observed)
mu = np.full(len(X_test), y_mean)
sigma = np.full(len(X_test), np.std(self.y_observed) if len(self.y_observed) > 1 else 1.0)
return mu, sigma
def optimize(self, objective_func):
"""Run Bayesian optimization."""
# Initial random sampling
for _ in range(self.n_initial):
params = self._sample_random()
value = objective_func(params)
self.X_observed.append(params)
self.y_observed.append(value)
if value > self.best_value:
self.best_value = value
self.best_params = params
# Bayesian optimization iterations
for i in range(self.n_iterations):
# In real implementation: fit GP, maximize acquisition function
# Here: sample with probability proportional to EI estimate
params = self._sample_random()
value = objective_func(params)
self.X_observed.append(params)
self.y_observed.append(value)
if value > self.best_value:
self.best_value = value
self.best_params = params
print(f' BO Iter {i+1}: New best {value:.4f}')
return self.best_params, self.best_value
# Demonstration
if __name__ == '__main__':
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
# Generate data
X, y = make_classification(
n_samples=500, n_features=10, random_state=42
)
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10]
}
# Grid Search
tuner = HyperparameterTuner(
RandomForestClassifier,
param_grid,
scoring='accuracy',
cv=3
)
tuner.grid_search(X, y)
print(f'
Grid Search Best: {tuner.best_score_:.4f}')
print(f'Best params: {tuner.best_params_}')
# Random Search
tuner.random_search(X, y, n_iter=10)
print(f'
Random Search Best: {tuner.best_score_:.4f}')
print(f'Best params: {tuner.best_params_}')
Using Libraries (scikit-learn, scipy, numpy, scikit-optimize, optuna)
from sklearn.model_selection import (
GridSearchCV, RandomizedSearchCV, cross_val_score
)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, make_scorer
from scipy.stats import randint, uniform, loguniform
import numpy as np
# Generate dataset
print('=== HYPERPARAMETER TUNING EXAMPLES ===')
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42
)
# GRID SEARCH
print('
=== GRID SEARCH ===')
param_grid_rf = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'max_features': ['sqrt', 'log2']
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid_rf,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X, y)
print(f'Best CV Score: {grid_search.best_score_:.4f}')
print(f'Best Params: {grid_search.best_params_}')
print(f'Total fits: {len(grid_search.cv_results_["params"])}')
# RANDOM SEARCH
print('
=== RANDOMIZED SEARCH ===')
# Define distributions for random search
param_distributions_rf = {
'n_estimators': randint(50, 500),
'max_depth': [3, 5, 7, 10, 15, 20, None],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': ['sqrt', 'log2', None]
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions_rf,
n_iter=50, # Only try 50 combinations
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42,
verbose=1
)
random_search.fit(X, y)
print(f'Best CV Score: {random_search.best_score_:.4f}')
print(f'Best Params: {random_search.best_params_}')
print(f'Total fits: 50 (vs {len(grid_search.cv_results_["params"])} for grid)')
# SVM with logarithmic scale (important for C and \(\gamma\))
print('
=== SVM WITH LOG SCALE ===')
param_distributions_svm = {
'C': loguniform(1e-3, 1e3), # Log scale for regularization
'\(\gamma\)': loguniform(1e-4, 1e0), # Log scale for kernel
'kernel': ['rbf', 'linear']
}
random_search_svm = RandomizedSearchCV(
SVC(random_state=42),
param_distributions_svm,
n_iter=20,
cv=3,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
random_search_svm.fit(X, y)
print(f'Best CV Score: {random_search_svm.best_score_:.4f}')
print(f'Best C: {random_search_svm.best_params_["C"]:.4f}')
print(f'Best \(\gamma\): {random_search_svm.best_params_["\(\gamma\)"]:.4f}')
# NESTED CROSS-VALIDATION (unbiased performance estimate)
print('
=== NESTED CROSS-VALIDATION ===')
from sklearn.model_selection import cross_val_score, KFold
# Outer loop: evaluate model performance
# Inner loop: hyperparameter tuning
outer_cv = KFold(n_splits=3, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)
nested_scores = cross_val_score(
RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions_rf,
n_iter=20,
cv=inner_cv,
random_state=42
),
X, y, cv=outer_cv, scoring='accuracy'
)
print(f'Nested CV Accuracy: {nested_scores.mean():.4f} (+/- {nested_scores.std()*2:.4f})')
print(f'Non-nested CV Accuracy: {random_search.best_score_:.4f}')
print(f'Difference shows optimistic bias of non-nested approach')
# BAYESIAN OPTIMIZATION with scikit-optimize
print('
=== BAYESIAN OPTIMIZATION (scikit-optimize) ===')
try:
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
search_spaces = {
'n_estimators': Integer(50, 500),
'max_depth': Integer(3, 20),
'min_samples_split': Integer(2, 20),
'max_features': Categorical(['sqrt', 'log2'])
}
bayes_search = BayesSearchCV(
RandomForestClassifier(random_state=42),
search_spaces,
n_iter=30,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
bayes_search.fit(X, y)
print(f'Best CV Score: {bayes_search.best_score_:.4f}')
print(f'Best Params: {bayes_search.best_params_}')
except ImportError:
print('scikit-optimize not installed. Install with: pip install scikit-optimize')
# OPTUNA (modern Bayesian optimization)
print('
=== OPTUNA OPTIMIZATION ===')
try:
import optuna
from sklearn.model_selection import cross_val_score
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 3, 20),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2'])
}
model = RandomForestClassifier(**params, random_state=42)
score = cross_val_score(model, X, y, cv=3, scoring='accuracy').mean()
return score
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)
print(f'Best Score: {study.best_value:.4f}')
print(f'Best Params: {study.best_params}')
print(f'Number of trials: {len(study.trials)}')
except ImportError:
print('optuna not installed. Install with: pip install optuna')
When to Use
✅ Appropriate Use Cases:
- Grid Search: Small parameter spaces, need exhaustive coverage, computational budget allows
- Random Search: Large parameter spaces, limited budget, some parameters matter more than others
- Bayesian Optimization: Expensive evaluations, smooth-ish objective, need efficiency
- Nested CV: Final model evaluation, unbiased performance estimate for reporting
- Early Stopping (Hyperband): Very large search spaces, can identify poor configs quickly
- Population-Based Training: Online learning, evolving architectures during training
❌ Avoid When:
- Grid Search: More than 3-4 hyperparameters (curse of dimensionality)
- Random Search: Very small parameter spaces (grid is fine)
- Bayesian Optimization: Cheap evaluations, very noisy objectives, discrete-only spaces
- Any tuning without proper validation: Leads to overfitting the validation set
- Single train-test split: Always use cross-validation for tuning
- Ignoring preprocessing: Tuning on unscaled data may give wrong conclusions
Common Pitfalls
- Not scaling parameters: Use log scale for C, learning_rate (span multiple orders)
- Optimistic bias: Non-nested CV overestimates performance - use nested for final eval
- Too few iterations: Random/Bayesian search needs enough samples to explore
- Wide ranges: Unnecessarily large search spaces waste budget
- Ignoring convergence: Not checking if more iterations would help
- Data leakage: Tuning on test set - always maintain train/validation/test split
- Single metric: Optimize for the metric that matters for your application