Principal Component Analysis: Dimensionality Reduction and Variance Explained
Definition
Principal Component Analysis (PCA) is a fundamental unsupervised learning technique for dimensionality reduction that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible. PCA finds orthogonal directions (principal components) in which the data varies most, projecting observations onto these axes to create uncorrelated features ordered by importance. Mathematically, PCA performs an eigendecomposition of the covariance matrix (or SVD of the centered data matrix), where eigenvectors define principal component directions and eigenvalues quantify the variance explained by each component. The first principal component captures the maximum variance; subsequent components capture remaining variance orthogonally. PCA serves multiple purposes: reducing computational cost, mitigating the curse of dimensionality, removing multicollinearity, denoising data, and visualizing high-dimensional data. It assumes linear relationships and that variance corresponds to signal (not noise). While powerful, PCA is sensitive to feature scaling and outliers, and the resulting components may lack interpretability.
Intuition
Imagine taking a photograph of a 3D statue from different angles. Some angles reveal the statue's form better than others. PCA finds the 'best angles' to view your data - the directions that show the most spread and variation. If you squash a shadow onto a wall, PCA finds the wall orientation that keeps the shadow as spread out as possible. The components are like new coordinate axes rotated to align with your data's natural spread.
Mathematical Formula
Step-by-Step Explanation:
- Centering removes the mean - PCA finds directions relative to data center
- Covariance matrix captures how features vary together (linear relationships)
- Eigenvectors are the principal component directions (orthogonal axes)
- Eigenvalues measure variance along each principal component direction
- Variance explained shows what fraction of total variance each PC captures
- Projection transforms data to new coordinate system (component scores)
- Reconstruction error equals sum of discarded eigenvalues
Real-World Use Cases
Face recognition (Eigenfaces): Reduce pixel dimensions while preserving facial features. Compression and denoising.
Portfolio optimization: Identify latent factors driving asset returns (market, size, value factors).
Gene expression analysis: Reduce thousands of genes to principal components representing biological processes.
Exploring high-dimensional data by projecting to 2D/3D for plotting while preserving structure.
Implementation
Manual Implementation (No Libraries)
import numpy as np
class PCA:
"""
Manual implementation of Principal Component Analysis.
Uses eigendecomposition of covariance matrix.
"""
def __init__(self, n_components=None):
self.n_components = n_components
self.components_ = None
self.explained_variance_ = None
self.explained_variance_ratio_ = None
self.mean_ = None
self.singular_values_ = None
def fit(self, X):
"""Fit PCA to data X."""
# Center the data
self.mean_ = np.mean(X, axis=0)
X_centered = X - self.mean_
# Compute covariance matrix
# Using n-1 for sample covariance (unbiased)
n_samples = X.shape[0]
cov_matrix = (X_centered.T @ X_centered) / (n_samples - 1)
# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort in descending order
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Store components (eigenvectors)
self.components_ = eigenvectors.T
# Explained variance
self.explained_variance_ = eigenvalues
total_var = np.sum(eigenvalues)
self.explained_variance_ratio_ = eigenvalues / total_var
# Singular values (for compatibility)
self.singular_values_ = np.sqrt(eigenvalues * (n_samples - 1))
return self
def transform(self, X):
"""Transform data to principal component space."""
X_centered = X - self.mean_
# Select top n_components
if self.n_components is not None:
components = self.components_[:self.n_components]
else:
components = self.components_
# Project: Z = X_centered @ V_k
return X_centered @ components.T
def fit_transform(self, X):
"""Fit and transform in one step."""
self.fit(X)
return self.transform(X)
def inverse_transform(self, X_transformed):
"""Reconstruct original data from transformed data."""
if self.n_components is not None:
components = self.components_[:self.n_components]
else:
components = self.components_
# Reconstruct: X_hat = Z @ V_k^T + mean
return X_transformed @ components + self.mean_
def get_explained_variance_cumulative(self):
"""Get cumulative explained variance ratio."""
return np.cumsum(self.explained_variance_ratio_)
# Demonstration
if __name__ == '__main__':
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
# Generate high-dimensional data
X, y = make_classification(
n_samples=300, n_features=20, n_informative=5,
n_redundant=5, random_state=42
)
# Standardize (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X_scaled)
print(f'Original shape: {X_scaled.shape}')
print(f'Transformed shape: {X_pca.shape}')
print(f'
Explained variance by component:')
for i, var in enumerate(pca.explained_variance_ratio_[:5]):
print(f' PC{i+1}: {var:.3f}')
print(f'
Cumulative variance (first 5): {pca.get_explained_variance_cumulative()[4]:.3f}')
# Reconstruction
X_reconstructed = pca.inverse_transform(X_pca)
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f'
Reconstruction MSE: {reconstruction_error:.4f}')
Using Libraries (scikit-learn, numpy, matplotlib)
from sklearn.decomposition import PCA, KernelPCA, IncrementalPCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits, fetch_olivetti_faces, make_classification
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
# Load high-dimensional dataset
print('=== PCA ON DIGITS DATASET ===')
digits = load_digits()
X, y = digits.data, digits.target
print(f'Original shape: {X.shape}')
print(f'Features (pixels): {X.shape[1]}')
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Determine optimal number of components
pca_full = PCA()
pca_full.fit(X_scaled)
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
n_components_99 = np.argmax(cumulative_variance >= 0.99) + 1
print(f'
Components for 95% variance: {n_components_95}')
print(f'Components for 99% variance: {n_components_99}')
print(f'Reduction: {X.shape[1]} → {n_components_95} ({n_components_95/X.shape[1]*100:.1f}%)')
# PCA with chosen components
pca = PCA(n_components=n_components_95)
X_pca = pca.fit_transform(X_scaled)
print(f'
Transformed shape: {X_pca.shape}')
print(f'Explained variance: {pca.explained_variance_ratio_.sum():.3f}')
# Variance explained by first few components
print(f'
First 5 components explain: {pca.explained_variance_ratio_[:5].sum():.3f}')
for i in range(5):
print(f' PC{i+1}: {pca.explained_variance_ratio_[i]:.3f}')
# Reconstruction
X_reconstructed = pca.inverse_transform(X_pca)
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f'
Reconstruction MSE: {reconstruction_error:.4f}')
# FACES EXAMPLE (Eigenfaces)
print('
=== EIGENFACES EXAMPLE ===')
# Generate synthetic faces data (or use real if available)
faces = fetch_olivetti_faces()
X_faces = faces.data
print(f'Faces shape: {X_faces.shape}')
print(f'Image size: {faces.images.shape[1:]}')
# Standardize
X_faces_centered = X_faces - X_faces.mean(axis=0)
# PCA on faces
pca_faces = PCA(n_components=150)
pca_faces.fit(X_faces_centered)
print(f'
150 components explain: {pca_faces.explained_variance_ratio_.sum():.3f} of variance')
# Visualize mean face and first components
print(f'
Mean face shape: {pca_faces.mean_.shape}')
print(f'Component shape: {pca_faces.components_.shape}')
# INCREMENTAL PCA (for large datasets)
print('
=== INCREMENTAL PCA ===')
# Simulate streaming data
ipca = IncrementalPCA(n_components=20)
# Process in batches
batch_size = 100
for i in range(0, len(X_scaled), batch_size):
batch = X_scaled[i:min(i+batch_size, len(X_scaled))]
ipca.partial_fit(batch)
print(f'Incremental PCA components: {ipca.n_components_}')
print(f'Explained variance: {ipca.explained_variance_ratio_.sum():.3f}')
# KERNEL PCA (for non-linear relationships)
print('
=== KERNEL PCA ===')
from sklearn.datasets import make_circles
# Generate non-linear data
X_circles, y_circles = make_circles(n_samples=400, factor=0.3, noise=0.05)
kpca = KernelPCA(n_components=2, kernel='rbf', \(\gamma\)=10)
X_kpca = kpca.fit_transform(X_circles)
print(f'Original shape: {X_circles.shape}')
print(f'Kernel PCA shape: {X_kpca.shape}')
print(f'Eigenvalues: {kpca.eigenvalues_[:5]}')
# PCA FOR VISUALIZATION
print('
=== PCA FOR VISUALIZATION ===')
# Reduce to 2D for plotting
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
print(f'2D projection explains: {pca_2d.explained_variance_ratio_.sum():.3f} of variance')
print(f'PC1 explains: {pca_2d.explained_variance_ratio_[0]:.3f}')
print(f'PC2 explains: {pca_2d.explained_variance_ratio_[1]:.3f}')
# PCA PIPELINE WITH CLASSIFICATION
print('
=== PCA + CLASSIFICATION ===')
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Pipeline: Scale → PCA → Classify
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=0.95)), # Keep 95% variance
('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
print(f'Test accuracy: {pipeline.score(X_test, y_test):.3f}')
print(f'PCA reduced to: {pipeline.named_steps["pca"].n_components_} components')
When to Use
✅ Appropriate Use Cases:
- Dimensionality reduction before machine learning (reduce overfitting)
- Visualization of high-dimensional data in 2D/3D
- Removing multicollinearity among features
- Denoising data by keeping only high-variance components
- Feature extraction for downstream tasks
- Compression for storage or transmission
- When features have high redundancy
❌ Avoid When:
- When interpretability of individual features is required
- Non-linear relationships (use Kernel PCA or t-SNE instead)
- When variance doesn't correspond to signal (e.g., noisy features vary most)
- Classification tasks where class separation is not aligned with variance
- Sparse data (consider TruncatedSVD instead)
- When you need to preserve pairwise distances exactly
Common Pitfalls
- Not standardizing features: Features with larger scales dominate components
- Arbitrary component selection: Use explained variance threshold or elbow method
- Interpreting components: PCs are linear combinations - hard to interpret
- Assuming PCs are meaningful: First PC might capture noise
- Forgetting PCA is unsupervised: May not align with classification labels
- Outliers: PCA is sensitive to outliers - robust scaling recommended
- Using PCA on test data separately: Fit on train only, transform both