Principal Component Analysis: Dimensionality Reduction and Variance Explained

Intermediate Ml
~6 min read Ml

Definition

Principal Component Analysis (PCA) is a fundamental unsupervised learning technique for dimensionality reduction that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible. PCA finds orthogonal directions (principal components) in which the data varies most, projecting observations onto these axes to create uncorrelated features ordered by importance. Mathematically, PCA performs an eigendecomposition of the covariance matrix (or SVD of the centered data matrix), where eigenvectors define principal component directions and eigenvalues quantify the variance explained by each component. The first principal component captures the maximum variance; subsequent components capture remaining variance orthogonally. PCA serves multiple purposes: reducing computational cost, mitigating the curse of dimensionality, removing multicollinearity, denoising data, and visualizing high-dimensional data. It assumes linear relationships and that variance corresponds to signal (not noise). While powerful, PCA is sensitive to feature scaling and outliers, and the resulting components may lack interpretability.

Intuition

💡

Imagine taking a photograph of a 3D statue from different angles. Some angles reveal the statue's form better than others. PCA finds the 'best angles' to view your data - the directions that show the most spread and variation. If you squash a shadow onto a wall, PCA finds the wall orientation that keeps the shadow as spread out as possible. The components are like new coordinate axes rotated to align with your data's natural spread.

Mathematical Formula

Data Centering:
\[ \tilde{X} = X - \bar{X} \]
Covariance Matrix:
\[ \Sigma = \frac{1}{n-1} \tilde{X}^T \tilde{X} \]
Eigendecomposition:
\[ \Sigma v_i = \lambda_i v_i \]
Where:
- $\lambda_i$ = eigenvalue (variance in direction $v_i$)
- $v_i$ = eigenvector (principal component direction)
Variance Explained:
\[ \text{Variance Explained}_i = \frac{\lambda_i}{\sum_{j=1}^{p} \lambda_j} \]
Projection onto k components:
\[ Z = \tilde{X} V_k \]
Where $V_k = [v_1, v_2, ..., v_k]$ contains top k eigenvectors
Reconstruction:
\[ \hat{X} = Z V_k^T + \bar{X} \]
Reconstruction Error:
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} ||x_i - \hat{x}_i||^2 = \sum_{j=k+1}^{p} \lambda_j \]

Step-by-Step Explanation:

  1. Centering removes the mean - PCA finds directions relative to data center
  2. Covariance matrix captures how features vary together (linear relationships)
  3. Eigenvectors are the principal component directions (orthogonal axes)
  4. Eigenvalues measure variance along each principal component direction
  5. Variance explained shows what fraction of total variance each PC captures
  6. Projection transforms data to new coordinate system (component scores)
  7. Reconstruction error equals sum of discarded eigenvalues

Real-World Use Cases

Image Processing

Face recognition (Eigenfaces): Reduce pixel dimensions while preserving facial features. Compression and denoising.

Finance

Portfolio optimization: Identify latent factors driving asset returns (market, size, value factors).

Genomics

Gene expression analysis: Reduce thousands of genes to principal components representing biological processes.

Visualization

Exploring high-dimensional data by projecting to 2D/3D for plotting while preserving structure.

Implementation

Manual Implementation (No Libraries)

This implementation demonstrates PCA's core mechanics: centering, covariance computation, eigendecomposition, and projection. Components are sorted by eigenvalue magnitude. The transform projects data to the new basis; inverse_transform reconstructs it. Reconstruction error quantifies information loss from dimensionality reduction.
import numpy as np

class PCA:
    """
    Manual implementation of Principal Component Analysis.
    Uses eigendecomposition of covariance matrix.
    """
    
    def __init__(self, n_components=None):
        self.n_components = n_components
        self.components_ = None
        self.explained_variance_ = None
        self.explained_variance_ratio_ = None
        self.mean_ = None
        self.singular_values_ = None
    
    def fit(self, X):
        """Fit PCA to data X."""
        # Center the data
        self.mean_ = np.mean(X, axis=0)
        X_centered = X - self.mean_
        
        # Compute covariance matrix
        # Using n-1 for sample covariance (unbiased)
        n_samples = X.shape[0]
        cov_matrix = (X_centered.T @ X_centered) / (n_samples - 1)
        
        # Eigendecomposition
        eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
        
        # Sort in descending order
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        
        # Store components (eigenvectors)
        self.components_ = eigenvectors.T
        
        # Explained variance
        self.explained_variance_ = eigenvalues
        total_var = np.sum(eigenvalues)
        self.explained_variance_ratio_ = eigenvalues / total_var
        
        # Singular values (for compatibility)
        self.singular_values_ = np.sqrt(eigenvalues * (n_samples - 1))
        
        return self
    
    def transform(self, X):
        """Transform data to principal component space."""
        X_centered = X - self.mean_
        
        # Select top n_components
        if self.n_components is not None:
            components = self.components_[:self.n_components]
        else:
            components = self.components_
        
        # Project: Z = X_centered @ V_k
        return X_centered @ components.T
    
    def fit_transform(self, X):
        """Fit and transform in one step."""
        self.fit(X)
        return self.transform(X)
    
    def inverse_transform(self, X_transformed):
        """Reconstruct original data from transformed data."""
        if self.n_components is not None:
            components = self.components_[:self.n_components]
        else:
            components = self.components_
        
        # Reconstruct: X_hat = Z @ V_k^T + mean
        return X_transformed @ components + self.mean_
    
    def get_explained_variance_cumulative(self):
        """Get cumulative explained variance ratio."""
        return np.cumsum(self.explained_variance_ratio_)

# Demonstration
if __name__ == '__main__':
    from sklearn.datasets import make_classification
    from sklearn.preprocessing import StandardScaler
    
    # Generate high-dimensional data
    X, y = make_classification(
        n_samples=300, n_features=20, n_informative=5,
        n_redundant=5, random_state=42
    )
    
    # Standardize (important for PCA)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Apply PCA
    pca = PCA(n_components=10)
    X_pca = pca.fit_transform(X_scaled)
    
    print(f'Original shape: {X_scaled.shape}')
    print(f'Transformed shape: {X_pca.shape}')
    print(f'
Explained variance by component:')
    for i, var in enumerate(pca.explained_variance_ratio_[:5]):
        print(f'  PC{i+1}: {var:.3f}')
    print(f'
Cumulative variance (first 5): {pca.get_explained_variance_cumulative()[4]:.3f}')
    
    # Reconstruction
    X_reconstructed = pca.inverse_transform(X_pca)
    reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
    print(f'
Reconstruction MSE: {reconstruction_error:.4f}')

Using Libraries (scikit-learn, numpy, matplotlib)

from sklearn.decomposition import PCA, KernelPCA, IncrementalPCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits, fetch_olivetti_faces, make_classification
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

# Load high-dimensional dataset
print('=== PCA ON DIGITS DATASET ===')
digits = load_digits()
X, y = digits.data, digits.target

print(f'Original shape: {X.shape}')
print(f'Features (pixels): {X.shape[1]}')

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Determine optimal number of components
pca_full = PCA()
pca_full.fit(X_scaled)

cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
n_components_99 = np.argmax(cumulative_variance >= 0.99) + 1

print(f'
Components for 95% variance: {n_components_95}')
print(f'Components for 99% variance: {n_components_99}')
print(f'Reduction: {X.shape[1]} → {n_components_95} ({n_components_95/X.shape[1]*100:.1f}%)')

# PCA with chosen components
pca = PCA(n_components=n_components_95)
X_pca = pca.fit_transform(X_scaled)
print(f'
Transformed shape: {X_pca.shape}')
print(f'Explained variance: {pca.explained_variance_ratio_.sum():.3f}')

# Variance explained by first few components
print(f'
First 5 components explain: {pca.explained_variance_ratio_[:5].sum():.3f}')
for i in range(5):
    print(f'  PC{i+1}: {pca.explained_variance_ratio_[i]:.3f}')

# Reconstruction
X_reconstructed = pca.inverse_transform(X_pca)
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f'
Reconstruction MSE: {reconstruction_error:.4f}')

# FACES EXAMPLE (Eigenfaces)
print('
=== EIGENFACES EXAMPLE ===')

# Generate synthetic faces data (or use real if available)
faces = fetch_olivetti_faces()
X_faces = faces.data

print(f'Faces shape: {X_faces.shape}')
print(f'Image size: {faces.images.shape[1:]}')

# Standardize
X_faces_centered = X_faces - X_faces.mean(axis=0)

# PCA on faces
pca_faces = PCA(n_components=150)
pca_faces.fit(X_faces_centered)

print(f'
150 components explain: {pca_faces.explained_variance_ratio_.sum():.3f} of variance')

# Visualize mean face and first components
print(f'
Mean face shape: {pca_faces.mean_.shape}')
print(f'Component shape: {pca_faces.components_.shape}')

# INCREMENTAL PCA (for large datasets)
print('
=== INCREMENTAL PCA ===')

# Simulate streaming data
ipca = IncrementalPCA(n_components=20)

# Process in batches
batch_size = 100
for i in range(0, len(X_scaled), batch_size):
    batch = X_scaled[i:min(i+batch_size, len(X_scaled))]
    ipca.partial_fit(batch)

print(f'Incremental PCA components: {ipca.n_components_}')
print(f'Explained variance: {ipca.explained_variance_ratio_.sum():.3f}')

# KERNEL PCA (for non-linear relationships)
print('
=== KERNEL PCA ===')

from sklearn.datasets import make_circles

# Generate non-linear data
X_circles, y_circles = make_circles(n_samples=400, factor=0.3, noise=0.05)

kpca = KernelPCA(n_components=2, kernel='rbf', \(\gamma\)=10)
X_kpca = kpca.fit_transform(X_circles)

print(f'Original shape: {X_circles.shape}')
print(f'Kernel PCA shape: {X_kpca.shape}')
print(f'Eigenvalues: {kpca.eigenvalues_[:5]}')

# PCA FOR VISUALIZATION
print('
=== PCA FOR VISUALIZATION ===')

# Reduce to 2D for plotting
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

print(f'2D projection explains: {pca_2d.explained_variance_ratio_.sum():.3f} of variance')
print(f'PC1 explains: {pca_2d.explained_variance_ratio_[0]:.3f}')
print(f'PC2 explains: {pca_2d.explained_variance_ratio_[1]:.3f}')

# PCA PIPELINE WITH CLASSIFICATION
print('
=== PCA + CLASSIFICATION ===')
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Pipeline: Scale → PCA → Classify
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=0.95)),  # Keep 95% variance
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])

pipeline.fit(X_train, y_train)
print(f'Test accuracy: {pipeline.score(X_test, y_test):.3f}')
print(f'PCA reduced to: {pipeline.named_steps["pca"].n_components_} components')

When to Use

✅ Appropriate Use Cases:

  • Dimensionality reduction before machine learning (reduce overfitting)
  • Visualization of high-dimensional data in 2D/3D
  • Removing multicollinearity among features
  • Denoising data by keeping only high-variance components
  • Feature extraction for downstream tasks
  • Compression for storage or transmission
  • When features have high redundancy

❌ Avoid When:

  • When interpretability of individual features is required
  • Non-linear relationships (use Kernel PCA or t-SNE instead)
  • When variance doesn't correspond to signal (e.g., noisy features vary most)
  • Classification tasks where class separation is not aligned with variance
  • Sparse data (consider TruncatedSVD instead)
  • When you need to preserve pairwise distances exactly

Common Pitfalls

  • Not standardizing features: Features with larger scales dominate components
  • Arbitrary component selection: Use explained variance threshold or elbow method
  • Interpreting components: PCs are linear combinations - hard to interpret
  • Assuming PCs are meaningful: First PC might capture noise
  • Forgetting PCA is unsupervised: May not align with classification labels
  • Outliers: PCA is sensitive to outliers - robust scaling recommended
  • Using PCA on test data separately: Fit on train only, transform both