Logistic Regression: Classification, Sigmoid, and Odds Ratios
Definition
Logistic Regression is the foundational classification algorithm in machine learning, despite its name suggesting regression. It models the probability that a given input belongs to a particular class using the logistic (sigmoid) function, which maps any real-valued number to the range (0, 1). The algorithm estimates coefficients through Maximum Likelihood Estimation (MLE), finding parameters that maximize the probability of observing the training data. For binary classification, the decision boundary occurs where the predicted probability equals 0.5, creating a linear separator in feature space. Logistic regression naturally outputs calibrated probabilities, making it valuable for applications requiring confidence scores. The log-odds (logit) transformation linearizes the relationship between features and the log-probability ratio, enabling interpretation through odds ratios - the multiplicative change in odds for a one-unit increase in a feature.
Intuition
Imagine a dimmer switch that gradually transitions from off to on. Logistic regression is like finding the position where the switch flips, but instead of a sharp cutoff, it gives you a smooth probability curve. Think of it as drawing a line through your data that doesn't just predict categories, but tells you how confident it is - 'I'm 90% sure this is spam' versus 'This could go either way at 51%'.
Mathematical Formula
Step-by-Step Explanation:
- The sigmoid function squishes any value into a probability between 0 and 1
- Log-odds (logit) is linear in features, making it interpretable like linear regression
- Cross-entropy loss penalizes confident wrong predictions more heavily than uncertain ones
- Coefficients \(\beta\) represent the change in log-odds per unit change in the feature
- Odds ratio \(e^\beta\) tells us how much the odds multiply for a one-unit feature increase
- The decision boundary is linear - a hyperplane in feature space
Real-World Use Cases
Predicting diabetes risk based on age, BMI, blood pressure, and glucose levels. Odds ratios show which factors most increase risk.
Credit default prediction using income, debt-to-income ratio, and credit history. Output probabilities inform interest rate pricing.
Customer churn prediction identifying subscribers likely to cancel. Probabilities prioritize retention efforts.
Fraud detection flagging suspicious transactions. Calibrated probabilities reduce false alarms.
Implementation
Manual Implementation (No Libraries)
import numpy as np
class LogisticRegression:
"""
Manual implementation of Logistic Regression using gradient descent.
Demonstrates the core mathematics: sigmoid, log-loss, and optimization.
"""
def __init__(self, learning_rate=0.1, max_iter=1000, tol=1e-6):
self.lr = learning_rate
self.max_iter = max_iter
self.tol = tol
self.weights = None
self.bias = None
self.loss_history = []
def _sigmoid(self, z):
"""Numerically stable sigmoid function."""
# Clip z to prevent overflow
z = np.clip(z, -500, 500)
return 1 / (1 + np.exp(-z))
def _compute_loss(self, y_true, y_pred):
"""Binary cross-entropy loss."""
# Add epsilon to prevent log(0)
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def fit(self, X, y):
"""Train using gradient descent."""
n_samples, n_features = X.shape
# Initialize parameters
self.weights = np.zeros(n_features)
self.bias = 0
# Gradient descent
for i in range(self.max_iter):
# Forward pass
linear_model = np.dot(X, self.weights) + self.bias
y_pred = self._sigmoid(linear_model)
# Compute loss
loss = self._compute_loss(y, y_pred)
self.loss_history.append(loss)
# Backward pass (compute gradients)
dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
db = (1 / n_samples) * np.sum(y_pred - y)
# Update parameters
self.weights -= self.lr * dw
self.bias -= self.lr * db
# Check convergence
if i > 0 and abs(self.loss_history[-2] - loss) < self.tol:
break
return self
def predict_proba(self, X):
"""Predict class probabilities."""
linear_model = np.dot(X, self.weights) + self.bias
return self._sigmoid(linear_model)
def predict(self, X):
"""Predict class labels (0 or 1)."""
return (self.predict_proba(X) >= 0.5).astype(int)
def get_odds_ratios(self):
"""Return odds ratios (exp of coefficients)."""
return np.exp(self.weights)
# Demonstration
if __name__ == '__main__':
np.random.seed(42)
# Generate synthetic binary classification data
n_samples = 200
X = np.random.randn(n_samples, 2)
# True decision boundary: 2*x1 + 3*x2 > 0
y = (2 * X[:, 0] + 3 * X[:, 1] > 0).astype(int)
# Split data
split = int(0.8 * n_samples)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# Train model
model = LogisticRegression(learning_rate=0.5, max_iter=1000)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy:.3f}')
print(f'Weights: {model.weights}')
print(f'Bias: {model.bias:.3f}')
print(f'Odds Ratios: {model.get_odds_ratios()}')
Using Libraries (scikit-learn, numpy, pandas, matplotlib)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.metrics import (
accuracy_score, classification_report, confusion_matrix,
roc_auc_score, roc_curve, precision_recall_curve
)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Load real-world dataset (Breast Cancer Wisconsin)
data = load_breast_cancer()
X, y = data.data, data.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features (essential for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Logistic Regression
# liblinear solver supports L1 and L2, saga supports all penalties
clf = LogisticRegression(
penalty='l2', # L2 regularization
C=1.0, # Inverse of regularization strength
solver='lbfgs', # Optimization algorithm
max_iter=1000,
random_state=42
)
clf.fit(X_train_scaled, y_train)
# Predictions
y_pred = clf.predict(X_test_scaled)
y_prob = clf.predict_proba(X_test_scaled)[:, 1]
# Evaluation
print('=== Classification Performance ===')
print(f'Accuracy: {accuracy_score(y_test, y_pred):.3f}')
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}')
print('
Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
print('
Classification Report:')
print(classification_report(y_test, y_pred, target_names=data.target_names))
# Feature importance via coefficients
feature_df = pd.DataFrame({
'feature': data.feature_names,
'coefficient': clf.coef_[0],
'abs_coefficient': np.abs(clf.coef_[0]),
'odds_ratio': np.exp(clf.coef_[0])
}).sort_values('abs_coefficient', ascending=False)
print('
=== Top 10 Most Important Features (by |coefficient|) ===')
print(feature_df.head(10)[['feature', 'coefficient', 'odds_ratio']])
# Cross-validation
cv_scores = cross_val_score(clf, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f'
=== Cross-Validation ROC-AUC ===')
print(f'{cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})')
# Hyperparameter tuning
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}
# Note: saga solver supports both L1 and L2
grid_search = GridSearchCV(
LogisticRegression(solver='saga', max_iter=2000, random_state=42),
param_grid,
cv=5,
scoring='roc_auc'
)
grid_search.fit(X_train_scaled, y_train)
print(f'
=== Best Parameters ===')
print(grid_search.best_params_)
print(f'Best CV ROC-AUC: {grid_search.best_score_:.3f}')
When to Use
✅ Appropriate Use Cases:
- Binary classification with need for probability estimates
- Interpretable model requirements (coefficients show feature impact on log-odds)
- Baseline for classification tasks before trying complex models
- When you need odds ratios (healthcare, social sciences)
- Linearly separable or nearly linearly separable data
- Well-calibrated probability outputs needed (medical diagnosis, risk scoring)
❌ Avoid When:
- Complex non-linear decision boundaries (use SVM with kernels or tree-based models)
- High-dimensional sparse data with many irrelevant features (try L1 regularization first)
- Regression problems (despite the name, this is for classification)
- Severe class imbalance without proper weighting or resampling
- When features have complex interactions (neural networks or ensembles may work better)
Common Pitfalls
- Not scaling features: Regularization assumes all features on same scale
- Complete separation: Perfect prediction causes coefficient estimates to diverge
- Multicollinearity: Highly correlated features inflate standard errors
- Class imbalance: Skewed classes bias toward majority class
- Ignoring probability calibration: Raw scores may not reflect true probabilities
- Linear assumption: Model assumes linear relationship in log-odds space
- Convergence warnings: Increase max_iter or check for perfect separation