K-Nearest Neighbors: Distance Metrics, Choosing K, and Curse of Dimensionality
Definition
K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm that classifies or predicts based on the k closest training examples in feature space. Unlike model-based approaches that learn explicit parameters, KNN stores the entire training dataset and defers computation until prediction time - a 'lazy learning' approach. The algorithm's behavior is governed entirely by three choices: the distance metric (typically Euclidean, Manhattan, or Minkowski), the number of neighbors k, and the weighting scheme (uniform or distance-weighted). KNN's simplicity belies its power: it can approximate arbitrarily complex decision boundaries given sufficient data, requires no training phase, and naturally supports multi-class problems. However, it suffers from the curse of dimensionality - as feature space grows, distance becomes less meaningful, and computational costs explode. Despite these limitations, KNN remains valuable as a baseline, for recommendation systems, and in domains where decision boundaries are highly irregular.
Intuition
Imagine you're new to a neighborhood and want to know if it's safe. You ask your 5 closest neighbors about their experiences and go with the majority opinion. KNN works the same way: to classify a new point, find its k nearest neighbors in the training data and let them vote. The key is defining 'near' through distance metrics - just like deciding whether 'close' means walking distance, driving distance, or time to commute.
Mathematical Formula
Step-by-Step Explanation:
- Euclidean distance is the straight-line distance in n-dimensional space (p=2)
- Manhattan distance sums absolute differences - useful for grid-like spaces
- Minkowski with p=1 is Manhattan, p=2 is Euclidean, p→∞ is Chebyshev
- Cosine similarity measures angle between vectors, ignoring magnitude
- Weighted voting gives closer neighbors more influence via inverse distance
- The curse of dimensionality: as n increases, all distances become similar
Real-World Use Cases
Collaborative filtering: recommend movies based on users with similar rating patterns (user-user KNN) or similar movies rated by the user (item-item KNN).
Handwritten digit classification: simple baseline using pixel-wise distance; works surprisingly well for clean images.
Disease prediction based on similarity to historical patient cases with known outcomes.
Fraud detection: transactions far from normal behavior clusters flagged as suspicious.
Implementation
Manual Implementation (No Libraries)
import numpy as np
from collections import Counter
from scipy import stats
class KNN:
"""
Manual implementation of K-Nearest Neighbors classifier.
Supports multiple distance metrics and weighted voting.
"""
def __init__(self, k=5, distance_metric='euclidean', weights='uniform'):
self.k = k
self.distance_metric = distance_metric
self.weights = weights
self.X_train = None
self.y_train = None
def _compute_distance(self, x1, x2):
"""Compute distance between two vectors."""
if self.distance_metric == 'euclidean':
return np.sqrt(np.sum((x1 - x2) ** 2))
elif self.distance_metric == 'manhattan':
return np.sum(np.abs(x1 - x2))
elif self.distance_metric == 'cosine':
# Cosine distance = 1 - cosine similarity
dot_product = np.dot(x1, x2)
norm_x1 = np.linalg.norm(x1)
norm_x2 = np.linalg.norm(x2)
if norm_x1 == 0 or norm_x2 == 0:
return 1.0
return 1 - (dot_product / (norm_x1 * norm_x2))
elif self.distance_metric == 'minkowski':
# Default p=3 for Minkowski
return np.power(np.sum(np.abs(x1 - x2) ** 3), 1/3)
else:
raise ValueError(f'Unknown distance metric: {self.distance_metric}')
def fit(self, X, y):
"""Store training data (lazy learning)."""
self.X_train = X
self.y_train = y
return self
def _get_neighbors(self, x):
"""Find k nearest neighbors for a single sample."""
distances = []
for i, x_train in enumerate(self.X_train):
dist = self._compute_distance(x, x_train)
distances.append((dist, self.y_train[i], i))
# Sort by distance and return k nearest
distances.sort(key=lambda x: x[0])
return distances[:self.k]
def predict_single(self, x):
"""Predict class for a single sample."""
neighbors = self._get_neighbors(x)
if self.weights == 'uniform':
# Simple majority vote
labels = [label for _, label, _ in neighbors]
return stats.mode(labels, keepdims=True).mode[0]
else:
# Distance-weighted voting
votes = Counter()
for dist, label, _ in neighbors:
if dist == 0:
# Exact match - return immediately
return label
weight = 1 / dist
votes[label] += weight
return votes.most_common(1)[0][0]
def predict(self, X):
"""Predict classes for multiple samples."""
return np.array([self.predict_single(x) for x in X])
def predict_proba(self, X):
"""Predict class probabilities."""
probabilities = []
classes = np.unique(self.y_train)
for x in X:
neighbors = self._get_neighbors(x)
class_votes = Counter()
if self.weights == 'uniform':
for _, label, _ in neighbors:
class_votes[label] += 1
else:
for dist, label, _ in neighbors:
weight = 1 / dist if dist > 0 else float('inf')
class_votes[label] += weight
# Normalize to probabilities
total = sum(class_votes.values())
probas = [class_votes.get(c, 0) / total for c in classes]
probabilities.append(probas)
return np.array(probabilities)
# Demonstration
if __name__ == '__main__':
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Generate data
X, y = make_classification(
n_samples=300, n_features=5, n_redundant=0,
n_informative=5, n_clusters_per_class=1, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Test different configurations
for metric in ['euclidean', 'manhattan']:
for weights in ['uniform', 'distance']:
knn = KNN(k=5, distance_metric=metric, weights=weights)
knn.fit(X_train_scaled, y_train)
acc = np.mean(knn.predict(X_test_scaled) == y_test)
print(f'{metric.title()}, {weights}: {acc:.3f}')
Using Libraries (scikit-learn, numpy, matplotlib)
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor, NearestNeighbors
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris, load_breast_cancer, make_classification
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np
import matplotlib.pyplot as plt
# Load datasets
print('=== KNN CLASSIFICATION ===')
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)
# ALWAYS scale for KNN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Basic KNN
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
knn.fit(X_train_scaled, y_train)
print(f'K=5 Accuracy: {knn.score(X_test_scaled, y_test):.3f}')
# CHOOSING K: Cross-validation
print('
=== CHOOSING OPTIMAL K ===')
k_range = range(1, 31)
cv_scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train_scaled, y_train, cv=5, scoring='accuracy')
cv_scores.append(scores.mean())
optimal_k = k_range[np.argmax(cv_scores)]
print(f'Optimal K from CV: {optimal_k}')
print(f'Best CV Accuracy: {max(cv_scores):.3f}')
# Different distance metrics
print('
=== DISTANCE METRICS COMPARISON ===')
metrics = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']
for metric in metrics:
if metric == 'minkowski':
knn = KNeighborsClassifier(n_neighbors=5, metric=metric, p=3)
else:
knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
knn.fit(X_train_scaled, y_train)
acc = knn.score(X_test_scaled, y_test)
print(f'{metric.capitalize()}: {acc:.3f}')
# Weighted vs Uniform
print('
=== WEIGHTED VOTING ===')
for weights in ['uniform', 'distance']:
knn = KNeighborsClassifier(n_neighbors=5, weights=weights)
knn.fit(X_train_scaled, y_train)
acc = knn.score(X_test_scaled, y_test)
print(f'{weights.capitalize()}: {acc:.3f}')
# Grid Search for all hyperparameters
print('
=== GRID SEARCH ===')
param_grid = {
'n_neighbors': range(1, 30),
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski'],
'p': [1, 2, 3] # For Minkowski
}
grid_search = GridSearchCV(
KNeighborsClassifier(),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X_train_scaled, y_train)
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best CV Accuracy: {grid_search.best_score_:.3f}')
print(f'Test Accuracy: {grid_search.score(X_test_scaled, y_test):.3f}')
# Finding nearest neighbors (for recommendation systems)
print('
=== NEAREST NEIGHBORS (Recommendation) ===')
neigh = NearestNeighbors(n_neighbors=3, metric='cosine')
neigh.fit(X_train_scaled)
# Find 3 most similar samples to first test sample
distances, indices = neigh.kneighbors([X_test_scaled[0]])
print(f'Sample 0 most similar to training indices: {indices[0]}')
print(f'Distances: {distances[0]}')
When to Use
✅ Appropriate Use Cases:
- Baseline classifier for comparison with complex models
- Multi-class problems without modification
- Non-linear decision boundaries without model assumptions
- Recommendation systems (collaborative filtering)
- Small to medium datasets (computation scales with n)
- When interpretability of individual predictions matters
- Anomaly detection using distance thresholds
❌ Avoid When:
- Large datasets: O(nd) prediction time per sample
- High-dimensional data (>20 features): curse of dimensionality
- When you need fast predictions at scale (no training but slow inference)
- Imbalanced datasets: majority class dominates neighbor votes
- When memory is constrained: stores entire training set
- When feature scales vary greatly: requires careful preprocessing
Common Pitfalls
- Not scaling features: Features with larger ranges dominate distance
- Wrong choice of k: Too small = noise sensitive, too large = over-smoothing
- Not handling ties: Use odd k for binary classification
- Ignoring dimensionality: KNN degrades in high-dimensional spaces
- Using default Euclidean without considering data nature
- Not preprocessing categorical variables: Distance undefined for categories
- Storing redundant data: Consider approximate methods (LSH, KD-trees) for large n