Model Evaluation: Metrics, Cross-Validation, and ROC-AUC

Definition

Model evaluation is the cornerstone of machine learning practice, providing rigorous methods to assess how well models generalize to unseen data. The evaluation process encompasses multiple techniques: hold-out validation for simple assessment, cross-validation for robust performance estimation, and specialized metrics tailored to problem characteristics. Classification metrics include accuracy (overall correctness), precision (positive predictive value), recall (sensitivity/true positive rate), F1-score (harmonic mean of precision and recall), and ROC-AUC (area under the receiver operating characteristic curve measuring discrimination ability). Regression metrics include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R² (coefficient of determination). Beyond single metrics, practitioners must understand class imbalance effects, threshold selection for probabilistic classifiers, calibration of predicted probabilities, and the bias-variance tradeoff. Proper evaluation prevents overfitting to the test set, detects data leakage, and ensures models perform reliably in production environments.

Intuition

💡

Think of model evaluation like test-driving cars. You wouldn't buy a car after only driving on a straight empty road - you need highways, city traffic, and rough terrain. Similarly, we evaluate models on multiple datasets (cross-validation) and under different conditions (thresholds, class distributions). Just as fuel efficiency and safety ratings measure different aspects of cars, precision and recall measure different aspects of classifier performance. And just as a car's test performance might not match real-world driving, a model's training score often overestimates real performance - hence the need for proper validation.

Mathematical Formula

Accuracy:

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Precision (Positive Predictive Value):

\text{Precision} = \frac{TP}{TP + FP}

Recall (Sensitivity, True Positive Rate):

\text{Recall} = \frac{TP}{TP + FN}

F1-Score (Harmonic Mean):

F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Specificity (True Negative Rate):

\text{Specificity} = \frac{TN}{TN + FP}

ROC-AUC:

\text{AUC} = \int_{0}^{1} TPR(FPR^{-1}(x)) \, dx

Matthews Correlation Coefficient:

MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

R² Score (Coefficient of Determination):

R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Step-by-Step Explanation:

TP = True Positives, TN = True Negatives, FP = False Positives, FN = False Negatives
Precision: Of predicted positives, what fraction are actually positive?
Recall: Of actual positives, what fraction did we correctly identify?
F1 balances precision and recall; useful when you need both metrics high
Specificity: Of actual negatives, what fraction did we correctly identify?
ROC-AUC measures classifier's ability to distinguish classes across all thresholds
MCC is a balanced measure for binary classification (-1 to +1, 0 = random)
R² shows what fraction of variance in y is explained by the model

Real-World Use Cases

Medical Diagnosis

Cancer screening prioritizes recall (don't miss sick patients) over precision. False negatives are costlier than false positives.

Fraud Detection

Credit card fraud needs balanced F1: flagging too many transactions (low precision) annoys customers, missing fraud (low recall) costs money.

Search Engines

Precision at k measures relevance of top-k results. Users care about first page quality, not total recall.

Manufacturing

Predictive maintenance uses ROC-AUC to evaluate models detecting equipment failures before they occur.

Implementation

Manual Implementation (No Libraries)

This implementation covers the fundamental evaluation metrics for binary classification. The confusion matrix forms the basis for all derived metrics. ROC-AUC measures classifier discrimination across thresholds. Cross-validation provides robust estimates by averaging performance across multiple train-test splits.

import numpy as np
from collections import Counter

class ModelEvaluator:
    """
    Manual implementation of common classification metrics.
    """
    
    @staticmethod
    def confusion_matrix(y_true, y_pred):
        """
        Compute confusion matrix.
        Returns: [[TN, FP], [FN, TP]] for binary case
        """
        classes = sorted(set(y_true) | set(y_pred))
        n_classes = len(classes)
        cm = np.zeros((n_classes, n_classes), dtype=int)
        
        class_to_idx = {c: i for i, c in enumerate(classes)}
        
        for true, pred in zip(y_true, y_pred):
            cm[class_to_idx[true], class_to_idx[pred]] += 1
        
        return cm
    
    @staticmethod
    def classification_metrics(y_true, y_pred):
        """Compute comprehensive classification metrics."""
        cm = ModelEvaluator.confusion_matrix(y_true, y_pred)
        
        if cm.shape == (2, 2):  # Binary classification
            tn, fp = cm[0, 0], cm[0, 1]
            fn, tp = cm[1, 0], cm[1, 1]
            
            metrics = {
                'accuracy': (tp + tn) / (tp + tn + fp + fn),
                'precision': tp / (tp + fp) if (tp + fp) > 0 else 0,
                'recall': tp / (tp + fn) if (tp + fn) > 0 else 0,
                'specificity': tn / (tn + fp) if (tn + fp) > 0 else 0,
                'f1_score': 0,
                'true_positives': tp,
                'true_negatives': tn,
                'false_positives': fp,
                'false_negatives': fn
            }
            
            # F1 score
            if metrics['precision'] + metrics['recall'] > 0:
                metrics['f1_score'] = 2 * metrics['precision'] * metrics['recall'] / \
                                      (metrics['precision'] + metrics['recall'])
        else:
            # Multi-class: compute per-class and macro-average
            metrics = {'per_class': {}, 'macro': {}}
            
            for i, cls in enumerate(range(cm.shape[0])):
                tp = cm[i, i]
                fp = np.sum(cm[:, i]) - tp
                fn = np.sum(cm[i, :]) - tp
                tn = np.sum(cm) - tp - fp - fn
                
                precision = tp / (tp + fp) if (tp + fp) > 0 else 0
                recall = tp / (tp + fn) if (tp + fn) > 0 else 0
                f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
                
                metrics['per_class'][cls] = {
                    'precision': precision,
                    'recall': recall,
                    'f1': f1
                }
        
        return metrics
    
    @staticmethod
    def roc_curve(y_true, y_scores):
        """
        Compute ROC curve points.
        Returns: fpr_list, tpr_list, thresholds
        """
        # Sort by score descending
        sorted_indices = np.argsort(y_scores)[::-1]
        y_true_sorted = y_true[sorted_indices]
        y_scores_sorted = y_scores[sorted_indices]
        
        # Get unique thresholds
        thresholds = np.unique(y_scores_sorted)
        thresholds = np.concatenate([thresholds, [thresholds[-1] + 1]])
        
        tprs = []
        fprs = []
        
        n_pos = np.sum(y_true)
        n_neg = len(y_true) - n_pos
        
        for threshold in thresholds:
            y_pred = (y_scores >= threshold).astype(int)
            tp = np.sum((y_true == 1) & (y_pred == 1))
            fp = np.sum((y_true == 0) & (y_pred == 1))
            
            tpr = tp / n_pos if n_pos > 0 else 0
            fpr = fp / n_neg if n_neg > 0 else 0
            
            tprs.append(tpr)
            fprs.append(fpr)
        
        return np.array(fprs), np.array(tprs), thresholds
    
    @staticmethod
    def auc_roc(fpr, tpr):
        """Compute AUC using trapezoidal rule."""
        return np.trapz(tpr, fpr)
    
    @staticmethod
    def cross_validation_split(X, y, n_folds=5, shuffle=True, random_state=None):
        """Generate k-fold cross-validation indices."""
        n_samples = len(y)
        indices = np.arange(n_samples)
        
        if shuffle:
            np.random.seed(random_state)
            np.random.shuffle(indices)
        
        fold_sizes = np.full(n_folds, n_samples // n_folds)
        fold_sizes[:n_samples % n_folds] += 1
        
        current = 0
        for fold_size in fold_sizes:
            start, stop = current, current + fold_size
            val_idx = indices[start:stop]
            train_idx = np.concatenate([indices[:start], indices[stop:]])
            yield train_idx, val_idx
            current = stop

# Demonstration
if __name__ == '__main__':
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    
    # Generate imbalanced binary classification data
    X, y = make_classification(
        n_samples=1000, n_classes=2, weights=[0.9, 0.1],
        n_features=10, random_state=42
    )
    
    # Simulate predictions (some bias toward majority class)
    y_pred = y.copy()
    # Introduce errors: miss 30% of positive class
    pos_idx = np.where(y == 1)[0]
    np.random.seed(42)
    false_neg_idx = np.random.choice(pos_idx, size=int(0.3 * len(pos_idx)), replace=False)
    y_pred[false_neg_idx] = 0
    # Add some false positives
    neg_idx = np.where(y == 0)[0]
    false_pos_idx = np.random.choice(neg_idx, size=20, replace=False)
    y_pred[false_pos_idx] = 1
    
    # Evaluate
    evaluator = ModelEvaluator()
    metrics = evaluator.classification_metrics(y, y_pred)
    
    print('Classification Metrics:')
    print(f'  Accuracy:  {metrics["accuracy"]:.3f}')
    print(f'  Precision: {metrics["precision"]:.3f}')
    print(f'  Recall:    {metrics["recall"]:.3f}')
    print(f'  F1-Score:  {metrics["f1_score"]:.3f}')
    print(f'  Specificity: {metrics["specificity"]:.3f}')

Using Libraries (scikit-learn, numpy, matplotlib)

from sklearn.model_selection import (
    train_test_split, cross_val_score, cross_validate,
    StratifiedKFold, GridSearchCV, learning_curve, validation_curve
)
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score,
    roc_curve, precision_recall_curve, average_precision_score,
    mean_squared_error, mean_absolute_error, r2_score,
    matthews_corrcoef, cohen_kappa_score, log_loss,
    balanced_accuracy_score
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification, make_regression
import numpy as np
import matplotlib.pyplot as plt

# Generate classification dataset
print('=== CLASSIFICATION METRICS ===')
X, y = make_classification(
    n_samples=2000, n_classes=2, weights=[0.8, 0.2],
    n_features=20, n_informative=10, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train model
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

# Predictions
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1]

# Basic metrics
print(f'Accuracy: {accuracy_score(y_test, y_pred):.3f}')
print(f'Balanced Accuracy: {balanced_accuracy_score(y_test, y_pred):.3f}')
print(f'Precision: {precision_score(y_test, y_pred):.3f}')
print(f'Recall: {recall_score(y_test, y_pred):.3f}')
print(f'F1-Score: {f1_score(y_test, y_pred):.3f}')
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}')
print(f'Average Precision: {average_precision_score(y_test, y_prob):.3f}')

# Advanced metrics
print(f'
Matthews Correlation Coefficient: {matthews_corrcoef(y_test, y_pred):.3f}')
print(f'Cohen Kappa: {cohen_kappa_score(y_test, y_pred):.3f}')
print(f'Log Loss: {log_loss(y_test, y_prob):.3f}')

# Classification report
print('
=== CLASSIFICATION REPORT ===')
print(classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1']))

# Confusion matrix
print('
Confusion Matrix:')
cm = confusion_matrix(y_test, y_pred)
print(cm)

# CROSS-VALIDATION
print('
=== CROSS-VALIDATION ===')

# Simple k-fold CV
cv_scores = cross_val_score(clf, X, y, cv=5, scoring='roc_auc')
print(f'5-Fold CV ROC-AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})')

# Multiple metrics with cross_validate
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
cv_results = cross_validate(clf, X, y, cv=StratifiedKFold(n_splits=5), scoring=scoring)

print('
Stratified CV Results:')
for metric in scoring:
    scores = cv_results[f'test_{metric}']
    print(f'  {metric}: {scores.mean():.3f} (+/- {scores.std()*2:.3f})')

# REGRESSION METRICS
print('
=== REGRESSION METRICS ===')
X_reg, y_reg = make_regression(
    n_samples=1000, n_features=10, noise=10, random_state=42
)

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train_r, y_train_r)
y_pred_r = reg.predict(X_test_r)

mse = mean_squared_error(y_test_r, y_pred_r)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test_r, y_pred_r)
r2 = r2_score(y_test_r, y_pred_r)

print(f'MSE: {mse:.2f}')
print(f'RMSE: {rmse:.2f}')
print(f'MAE: {mae:.2f}')
print(f'R²: {r2:.3f}')

# LEARNING CURVES
print('
=== LEARNING CURVES ===')

train_sizes, train_scores, val_scores = learning_curve(
    LogisticRegression(random_state=42), X, y,
    cv=5, train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='roc_uc'
)

print(f'Training sizes: {train_sizes}')
print(f'Mean train scores: {train_scores.mean(axis=1)}')
print(f'Mean val scores: {val_scores.mean(axis=1)}')

# VALIDATION CURVE (effect of hyperparameter)
print('
=== VALIDATION CURVE ===')
param_range = [0.001, 0.01, 0.1, 1, 10, 100]
train_scores_vc, val_scores_vc = validation_curve(
    LogisticRegression(random_state=42), X, y,
    param_name='C', param_range=param_range,
    cv=5, scoring='roc_auc'
)

print('C parameter effect on ROC-AUC:')
for c, train_mean, val_mean in zip(param_range, train_scores_vc.mean(axis=1), val_scores_vc.mean(axis=1)):
    print(f'  C={c}: Train={train_mean:.3f}, Val={val_mean:.3f}')

When to Use

✅ Appropriate Use Cases:

Accuracy: Balanced classes, equal misclassification costs
Precision: When false positives are costly (spam detection)
Recall: When false negatives are costly (medical screening)
F1-Score: Need balance between precision and recall
ROC-AUC: Comparing models, threshold-independent evaluation
MCC: Imbalanced data, need balanced measure
Cross-validation: Small datasets, robust performance estimation
Stratified CV: Imbalanced classes, maintain class proportions

❌ Avoid When:

Accuracy: Imbalanced classes (can be misleadingly high)
Precision alone: When missing positives is critical
Recall alone: When false alarm rate matters
ROC-AUC: Severely imbalanced data (use PR-AUC instead)
Single train-test split: Small datasets (high variance)
R²: Non-linear relationships (can be negative)
MSE: When outliers should not be heavily penalized (use MAE)

Common Pitfalls

Optimizing for wrong metric: Choose metric aligned with business goal
Not using stratification: Imbalanced classes need stratified splits
Data leakage: Preprocessing on full data before splitting
Overfitting validation set: Repeated tuning degrades estimate
Single metric focus: Use multiple metrics for complete picture
Ignoring calibration: Good discrimination doesn't mean good probabilities
Threshold default: Default 0.5 may not be optimal - tune for your metric