Outlier Handling: Detection Methods and Treatment Strategies

Intermediate Preprocessing
~12 min read Preprocessing

Definition

Outliers are data points that deviate significantly from other observations in a dataset. They can arise from measurement errors, data entry mistakes, natural variation, or novel phenomena. Outlier handling is critical because extreme values can distort statistical analyses, bias model training, reduce predictive accuracy, and lead to incorrect conclusions. The process involves two main stages: detection (identifying outliers using statistical or machine learning methods) and treatment (deciding whether to remove, transform, or cap extreme values). Detection methods range from simple statistical approaches like Z-score and Interquartile Range (IQR) to advanced techniques like Isolation Forest, Local Outlier Factor (LOF), and DBSCAN clustering. The choice of detection method depends on data dimensionality, distribution, and whether outliers are global (unusual across all data) or contextual (unusual within a specific context). Treatment strategies must balance the need to reduce outlier impact against the risk of losing valuable information—some outliers represent important rare events rather than errors.

Intuition

💡

Think of outliers like unusual events in a city. A temperature of 100°F in Alaska is an outlier—something is wrong with the thermometer or it's a data entry error (global outlier). A temperature of 60°F in summer might be normal overall but unusual for a particular neighborhood near the coast (contextual outlier). A celebrity visiting a small town is an outlier—not an error, but a rare event that could be interesting. The IQR method is like saying 'if you're outside the typical range of normal days, you're an outlier.' Z-score is like 'if you're more than 3 standard deviations from average, you're unusual.' Isolation Forest is like asking 'how many questions would I need to isolate this point from the rest?'—weird points are easier to isolate. Deciding what to do depends on whether the outlier is a mistake to fix, noise to remove, or a signal to investigate.

Mathematical Formula

\[ \text{Z-Score:} \quad z = \frac{x - \mu}{\sigma}, \quad |z| > 3 \Rightarrow \text{outlier} \]
\[ \text{IQR Method:} \quad \text{IQR} = Q_3 - Q_1 \]
\[ \text{Lower Bound} = Q_1 - 1.5 \times \text{IQR} \]
\[ \text{Upper Bound} = Q_3 + 1.5 \times \text{IQR} \]
\[ \text{Modified Z-Score:} \quad M_i = \frac{0.6745(x_i - \tilde{x})}{\text{MAD}}, \quad \text{MAD} = \text{median}(|x_i - \tilde{x}|) \]
\[ \text{Isolation Forest Score:} \quad s(x, n) = 2^{-\frac{E(h(x))}{c(n)}} \]

Step-by-Step Explanation:

  1. Z-Score: Standardized value; points with |z| > 2 (95%) or |z| > 3 (99.7%) are flagged as outliers
  2. IQR Method: Uses Q1 (25th percentile) and Q3 (75th percentile); values outside [Q1-1.5×IQR, Q3+1.5×IQR] are outliers
  3. Modified Z-Score: Uses median and MAD instead of mean/std; more robust to existing outliers
  4. Isolation Forest: Average path length in random trees; shorter paths indicate outliers
  5. LOF: Local density deviation; points with substantially lower density than neighbors are outliers

Real-World Use Cases

Healthcare

Lab results with impossible values (negative white blood cell count). IQR for normal ranges, Z-score for population outliers. Contextual: unusual vital signs for specific age groups. Modified Z-score for skewed biomarker distributions.

Finance

Fraud detection uses Isolation Forest for unusual transaction patterns. Z-score for price movements (identify flash crashes). Contextual outliers: unusual trades for a specific client's history. Winsorization for portfolio optimization.

Retail

Returns with negative quantities (data errors). Unusual purchase amounts (fraud). IQR for normal order values. Contextual: unusual purchases for customer segments. Seasonal outliers in demand forecasting.

Manufacturing

Sensor readings during equipment malfunction. IQR for normal operating ranges. Isolation Forest for multivariate anomaly detection (temperature + pressure + vibration). Contextual: unusual readings during startup vs steady-state.

Tech

Bot detection via unusual click patterns (LOF). Session duration outliers (DDoS attacks). Contextual: unusual behavior for user segments or times of day. Modified Z-score for skewed engagement metrics.

Implementation

Manual Implementation (No Libraries)

import numpy as np
import pandas as pd

# Create sample data with outliers
np.random.seed(42)
# Normal data
normal_data = np.random.normal(100, 15, 100)
# Add outliers
outliers = [200, 5, 180, 20, 210]
data = np.concatenate([normal_data, outliers])

df = pd.DataFrame({
    'value': data,
    'category': ['A'] * 50 + ['B'] * 50 + ['outlier'] * 5
})

print("Dataset Statistics:")
print(df['value'].describe())
print(f"
Data range: [{df['value'].min():.1f}, {df['value'].max():.1f}]")

# 1. Z-SCORE METHOD (Manual)
def zscore_outliers_manual(series, threshold=3):
    """
    Manual Z-score outlier detection.
    z = (x - mean) / std
    Flag if |z| > threshold
    """
    mean_val = np.mean(series)
    std_val = np.std(series, ddof=1)
    
    z_scores = (series - mean_val) / std_val
    outlier_mask = np.abs(z_scores) > threshold
    
    return outlier_mask, z_scores, mean_val, std_val

print("
=== 1. Z-SCORE METHOD (Manual) ===")
z_outliers, z_scores, z_mean, z_std = zscore_outliers_manual(df['value'], threshold=3)
print(f"Mean: {z_mean:.2f}, Std: {z_std:.2f}")
print(f"Outliers detected (|z| > 3): {z_outliers.sum()}")
print(f"Outlier indices: {df[z_outliers].index.tolist()}")
print(f"Outlier values: {df[z_outliers]['value'].values}")
print(f"Z-scores of outliers: {z_scores[z_outliers]}")

# 2. IQR METHOD (Manual)
def iqr_outliers_manual(series, k=1.5):
    """
    Manual IQR outlier detection.
    Lower bound = Q1 - k * IQR
    Upper bound = Q3 + k * IQR
    """
    # Sort for percentile calculation
    sorted_vals = np.sort(series)
    n = len(sorted_vals)
    
    # Calculate Q1 (25th percentile) and Q3 (75th percentile)
    q1_idx = int(0.25 * n)
    q3_idx = int(0.75 * n)
    q1 = sorted_vals[q1_idx]
    q3 = sorted_vals[q3_idx]
    iqr = q3 - q1
    
    # Calculate bounds
    lower_bound = q1 - k * iqr
    upper_bound = q3 + k * iqr
    
    # Detect outliers
    outlier_mask = (series < lower_bound) | (series > upper_bound)
    
    return outlier_mask, lower_bound, upper_bound, q1, q3, iqr

print("
=== 2. IQR METHOD (Manual) ===")
iqr_outliers, lb, ub, q1, q3, iqr = iqr_outliers_manual(df['value'], k=1.5)
print(f"Q1: {q1:.2f}, Q3: {q3:.2f}, IQR: {iqr:.2f}")
print(f"Lower bound: {lb:.2f}, Upper bound: {ub:.2f}")
print(f"Outliers detected: {iqr_outliers.sum()}")
print(f"Outlier values: {df[iqr_outliers]['value'].values}")

# Compare Z-score vs IQR
print("
=== COMPARISON ===")
print(f"Z-score detected: {z_outliers.sum()} outliers")
print(f"IQR detected: {iqr_outliers.sum()} outliers")
print(f"Intersection: {(z_outliers & iqr_outliers).sum()}")

# 3. MODIFIED Z-SCORE (Using MAD)
def modified_zscore_outliers(series, threshold=3.5):
    """
    Modified Z-score using Median Absolute Deviation (MAD).
    More robust to existing outliers.
    M_i = 0.6745 * (x_i - median) / MAD
    """
    median_val = np.median(series)
    mad = np.median(np.abs(series - median_val))
    
    if mad == 0:
        mad = np.mean(np.abs(series - median_val)) * 1.253
    
    modified_z = 0.6745 * (series - median_val) / mad
    outlier_mask = np.abs(modified_z) > threshold
    
    return outlier_mask, modified_z, median_val, mad

print("
=== 3. MODIFIED Z-SCORE (MAD-based) ===")
mz_outliers, mz_scores, median_val, mad = modified_zscore_outliers(df['value'], threshold=3.5)
print(f"Median: {median_val:.2f}, MAD: {mad:.2f}")
print(f"Outliers detected: {mz_outliers.sum()}")
print(f"Modified Z-scores: {mz_scores.round(3)}")

# 4. PERCENTILE METHOD
def percentile_outliers(series, lower_pct=1, upper_pct=99):
    """
    Flag values outside specified percentiles.
    """
    lower_bound = np.percentile(series, lower_pct)
    upper_bound = np.percentile(series, upper_pct)
    outlier_mask = (series < lower_bound) | (series > upper_bound)
    
    return outlier_mask, lower_bound, upper_bound

print("
=== 4. PERCENTILE METHOD ===")
pct_outliers, p_lower, p_upper = percentile_outliers(df['value'], 1, 99)
print(f"1st percentile: {p_lower:.2f}")
print(f"99th percentile: {p_upper:.2f}")
print(f"Outliers detected: {pct_outliers.sum()}")

# 5. WINSORIZATION (Capping)
def winsorize_manual(series, limits=(0.05, 0.05)):
    """
    Manual winsorization - cap extreme values at percentiles.
    """
    lower_limit, upper_limit = limits
    lower_pct = lower_limit * 100
    upper_pct = (1 - upper_limit) * 100
    
    lower_bound = np.percentile(series, lower_pct)
    upper_bound = np.percentile(series, upper_pct)
    
    winsorized = series.copy()
    winsorized = np.clip(winsorized, lower_bound, upper_bound)
    
    return winsorized, lower_bound, upper_bound

print("
=== 5. WINSORIZATION (Capping) ===")
winsorized, win_lower, win_upper = winsorize_manual(df['value'], limits=(0.05, 0.05))
print(f"Capped to [{win_lower:.2f}, {win_upper:.2f}]")
print(f"Original range: [{df['value'].min():.1f}, {df['value'].max():.1f}]")
print(f"Winsorized range: [{winsorized.min():.1f}, {winsorized.max():.1f}]")
print(f"Values capped: {(df['value'] != winsorized).sum()}")

# 6. TRIMMING (Removal)
def trim_outliers(series, method='iqr', **kwargs):
    """
    Remove outliers and return clean data.
    """
    if method == 'iqr':
        outlier_mask, _, _, _, _, _ = iqr_outliers_manual(series, **kwargs)
    elif method == 'zscore':
        outlier_mask, _, _, _, _ = zscore_outliers_manual(series, **kwargs)
    else:
        raise ValueError("Method must be 'iqr' or 'zscore'")
    
    clean_data = series[~outlier_mask]
    return clean_data, outlier_mask

print("
=== 6. TRIMMING (Removal) ===")
clean_iqr, mask_iqr = trim_outliers(df['value'], method='iqr', k=1.5)
clean_zscore, mask_zscore = trim_outliers(df['value'], method='zscore', threshold=3)

print(f"Original data size: {len(df)}")
print(f"After IQR trimming: {len(clean_iqr)} (removed {mask_iqr.sum()})")
print(f"After Z-score trimming: {len(clean_zscore)} (removed {mask_zscore.sum()})")

# 7. MULTIVARIATE OUTLIER (Mahalanobis distance - simplified)
def mahalanobis_outliers(df, cols, threshold=None):
    """
    Simplified Mahalanobis distance for multivariate outlier detection.
    """
    from scipy.stats import chi2
    
    # Get numeric data
    X = df[cols].values
    
    # Calculate mean and covariance
    mean = np.mean(X, axis=0)
    cov = np.cov(X.T)
    
    # Handle singular covariance
    try:
        cov_inv = np.linalg.inv(cov)
    except np.linalg.LinAlgError:
        cov_inv = np.linalg.pinv(cov)
    
    # Calculate Mahalanobis distance
    diff = X - mean
    md = np.sqrt(np.sum(diff @ cov_inv * diff, axis=1))
    
    # Set threshold based on chi-square distribution
    if threshold is None:
        threshold = np.sqrt(chi2.ppf(0.975, df=len(cols)))
    
    outlier_mask = md > threshold
    
    return outlier_mask, md

print("
=== 7. MAHALANOBIS DISTANCE (Multivariate) ===")
# Create 2D data for demonstration
df_2d = pd.DataFrame({
    'x': np.concatenate([np.random.normal(10, 2, 95), [25, 3, 22, 4, 20]]),
    'y': np.concatenate([np.random.normal(5, 1, 95), [2, 12, 3, 10, 2]])
})

mahal_outliers, mahal_dist = mahalanobis_outliers(df_2d, ['x', 'y'])
print(f"Outliers detected: {mahal_outliers.sum()}")
print(f"Mahalanobis distances (first 10): {mahal_dist[:10].round(2)}")

# 8. Summary comparison
print("
=== METHOD COMPARISON SUMMARY ===")
methods = {
    'Z-Score (|z|>3)': z_outliers.sum(),
    'IQR (k=1.5)': iqr_outliers.sum(),
    'Modified Z-Score': mz_outliers.sum(),
    'Percentile (1-99%)': pct_outliers.sum()
}
for method, count in methods.items():
    print(f"  {method}: {count} outliers")

print("
=== TREATMENT SUMMARY ===")
print(f"Winsorized: {(df['value'] != winsorized).sum()} values capped")
print(f"Trimmed (IQR): {mask_iqr.sum()} values removed")
print(f"Trimmed (Z-score): {mask_zscore.sum()} values removed")

Using Libraries ()

import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from scipy import stats
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Create sample data with various outliers
np.random.seed(42)

# Normal cluster
normal_data = np.random.multivariate_normal([10, 10], [[1, 0.5], [0.5, 1]], 100)

# Global outliers (far from everything)
global_outliers = np.array([[20, 20], [2, 2], [20, 5]])

# Contextual outliers (normal in one dimension, not in context)
contextual_outliers = np.array([[15, 5], [5, 15]])

# Combine
X = np.vstack([normal_data, global_outliers, contextual_outliers])
df = pd.DataFrame(X, columns=['feature1', 'feature2'])

# Add 1D data for univariate methods
df['value'] = np.concatenate([
    np.random.normal(100, 15, 100),
    [200, 50, 180],  # Global outliers
    [150]  # Another outlier
])

print("Dataset shape:", df.shape)
print("
Statistics:")
print(df.describe())

# 1. SCIPY STATS METHODS
print("
" + "="*60)
print("1. SCIPY STATS METHODS")
print("="*60)

# Z-score
z_scores = np.abs(stats.zscore(df['value']))
z_outliers = z_scores > 3
print(f"Z-score outliers (|z| > 3): {z_outliers.sum()}")
print(f"Outlier values: {df[z_outliers]['value'].values}")

# Modified Z-score (MAD-based)
median = np.median(df['value'])
mad = np.median(np.abs(df['value'] - median))
modified_z = 0.6745 * (df['value'] - median) / mad
modified_z_outliers = np.abs(modified_z) > 3.5
print(f"
Modified Z-score outliers: {modified_z_outliers.sum()}")

# 2. ISOLATION FOREST
print("
" + "="*60)
print("2. ISOLATION FOREST")
print("="*60)

# Univariate
iso_forest_1d = IsolationForest(contamination=0.1, random_state=42)
outliers_1d = iso_forest_1d.fit_predict(df[['value']])
iso_outliers_1d = outliers_1d == -1
print(f"Isolation Forest (1D) outliers: {iso_outliers_1d.sum()}")

# Multivariate
iso_forest = IsolationForest(contamination=0.1, random_state=42, n_estimators=100)
outliers_iso = iso_forest.fit_predict(df[['feature1', 'feature2']])
iso_outliers = outliers_iso == -1
print(f"Isolation Forest (2D) outliers: {iso_outliers.sum()}")
print(f"Outlier scores (first 10): {iso_forest.score_samples(df[['feature1', 'feature2']])[:10].round(3)}")

# Decision function (anomaly score)
decision_scores = iso_forest.decision_function(df[['feature1', 'feature2']])
print(f"Decision scores range: [{decision_scores.min():.3f}, {decision_scores.max():.3f}]")

# 3. LOCAL OUTLIER FACTOR (LOF)
print("
" + "="*60)
print("3. LOCAL OUTLIER FACTOR (LOF)")
print("="*60)

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
outliers_lof = lof.fit_predict(df[['feature1', 'feature2']])
lof_outliers = outliers_lof == -1
print(f"LOF outliers: {lof_outliers.sum()}")

# Negative outlier factor (lower = more outlying)
lof_scores = -lof.negative_outlier_factor_
print(f"LOF scores (first 10): {lof_scores[:10].round(3)}")
print(f"Outlier score threshold: {np.percentile(lof_scores, 90):.3f}")

# 4. ELLIPTIC ENVELOPE (Gaussian)
print("
" + "="*60)
print("4. ELLIPTIC ENVELOPE (Gaussian)")
print("="*60)

elliptic = EllipticEnvelope(contamination=0.1, random_state=42)
outliers_elliptic = elliptic.fit_predict(df[['feature1', 'feature2']])
elliptic_outliers = outliers_elliptic == -1
print(f"Elliptic Envelope outliers: {elliptic_outliers.sum()}")

# Mahalanobis distances
mahal_distances = elliptic.mahalanobis(df[['feature1', 'feature2']])
print(f"Mahalanobis distances (first 5): {mahal_distances[:5].round(3)}")

# 5. PANDAS/NUMPY QUANTILE METHODS
print("
" + "="*60)
print("5. QUANTILE-BASED METHODS")
print("="*60)

# IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
iqr_outliers = (df['value'] < lower_bound) | (df['value'] > upper_bound)

print(f"IQR bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")
print(f"IQR outliers: {iqr_outliers.sum()}")

# Custom percentiles
p01 = df['value'].quantile(0.01)
p99 = df['value'].quantile(0.99)
percentile_outliers = (df['value'] < p01) | (df['value'] > p99)
print(f"
1st-99th percentile bounds: [{p01:.2f}, {p99:.2f}]")
print(f"Percentile outliers: {percentile_outliers.sum()}")

# 6. TREATMENT METHODS
print("
" + "="*60
print("6. OUTLIER TREATMENT METHODS")
print("="*60)

# Method 1: Removal
df_clean = df[~iqr_outliers].copy()
print(f"1. Removal: {len(df)} → {len(df_clean)} rows")

# Method 2: Winsorization (capping)
from sklearn.preprocessing import FunctionTransformer

def winsorize_transform(X, limits=(0.05, 0.05)):
    lower_pct = limits[0] * 100
    upper_pct = (1 - limits[1]) * 100
    lower = np.percentile(X, lower_pct, axis=0)
    upper = np.percentile(X, upper_pct, axis=0)
    return np.clip(X, lower, upper)

winsorizer = FunctionTransformer(winsorize_transform, validate=True)
df_winsorized = df.copy()
df_winsorized[['value']] = winsorizer.fit_transform(df[['value']])
print(f"
2. Winsorization:")
print(f"   Original range: [{df['value'].min():.1f}, {df['value'].max():.1f}]")
print(f"   Winsorized range: [{df_winsorized['value'].min():.1f}, {df_winsorized['value'].max():.1f}]")

# Method 3: Log transformation
df_log = df.copy()
df_log['value_log'] = np.log1p(df['value'] - df['value'].min() + 1)
print(f"
3. Log transform:")
print(f"   Original std: {df['value'].std():.2f}")
print(f"   Log std: {df_log['value_log'].std():.2f}")

# Method 4: Robust scaling (less sensitive to outliers)
from sklearn.preprocessing import RobustScaler
robust = RobustScaler()
df_robust = df.copy()
df_robust[['value']] = robust.fit_transform(df[['value']])
print(f"
4. Robust scaling:")
print(f"   Original mean: {df['value'].mean():.2f}, std: {df['value'].std():.2f}")
print(f"   Robust median: {df['value'].median():.2f} (preserved)")

# 7. COMPARISON OF METHODS
print("
" + "="*60)
print("7. METHOD COMPARISON")
print("="*60)

methods_comparison = pd.DataFrame({
    'Method': ['Z-Score', 'IQR', 'Modified Z', 'Isolation Forest', 'LOF', 'Elliptic Envelope'],
    'Type': ['Statistical', 'Statistical', 'Statistical', 'ML', 'ML', 'Statistical'],
    'Dimensions': ['Any', 'Any', 'Any', 'Any', 'Any', 'Any'],
    'Outliers Detected': [
        z_outliers.sum(),
        iqr_outliers.sum(),
        modified_z_outliers.sum(),
        iso_outliers.sum(),
        lof_outliers.sum(),
        elliptic_outliers.sum()
    ],
    'Robust': ['No', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
})

print(methods_comparison.to_string(index=False))

# 8. CONTEXTUAL OUTLIER DETECTION
print("
" + "="*60)
print("8. CONTEXTUAL OUTLIER DETECTION")
print("="*60)

# Group-based outlier detection
df['group'] = np.random.choice(['A', 'B', 'C'], len(df))

def detect_group_outliers(df, value_col, group_col, method='iqr'):
    """Detect outliers within each group"""
    outlier_masks = []
    
    for group in df[group_col].unique():
        group_mask = df[group_col] == group
        group_data = df.loc[group_mask, value_col]
        
        if method == 'iqr':
            Q1 = group_data.quantile(0.25)
            Q3 = group_data.quantile(0.75)
            IQR = Q3 - Q1
            lower = Q1 - 1.5 * IQR
            upper = Q3 + 1.5 * IQR
            outlier_mask = (group_data < lower) | (group_data > upper)
        else:
            z = np.abs(stats.zscore(group_data))
            outlier_mask = z > 2
        
        outlier_masks.extend(outlier_mask.values)
    
    return np.array(outlier_masks)

contextual_outliers = detect_group_outliers(df, 'value', 'group', method='iqr')
print(f"Contextual outliers (by group): {contextual_outliers.sum()}")
print(f"Global outliers (all data): {iqr_outliers.sum()}")
print(f"Difference shows context matters: {abs(contextual_outliers.sum() - iqr_outliers.sum())} points")

# 9. PRODUCTION PIPELINE
print("
" + "="*60)
print("9. PRODUCTION OUTLIER HANDLER")
print("="*60)

from sklearn.base import BaseEstimator, TransformerMixin

class OutlierHandler(BaseEstimator, TransformerMixin):
    """
    Custom transformer for outlier handling in production pipelines.
    """
    def __init__(self, method='iqr', treatment='clip', contamination=0.1):
        self.method = method
        self.treatment = treatment
        self.contamination = contamination
        self.bounds_ = {}
        self.model_ = None
    
    def fit(self, X, y=None):
        if isinstance(X, pd.DataFrame):
            X = X.values
        
        if self.method == 'iqr':
            for i in range(X.shape[1]):
                Q1 = np.percentile(X[:, i], 25)
                Q3 = np.percentile(X[:, i], 75)
                IQR = Q3 - Q1
                self.bounds_[i] = {
                    'lower': Q1 - 1.5 * IQR,
                    'upper': Q3 + 1.5 * IQR
                }
        elif self.method == 'isolation_forest':
            self.model_ = IsolationForest(
                contamination=self.contamination,
                random_state=42
            )
            self.model_.fit(X)
        
        return self
    
    def transform(self, X):
        if isinstance(X, pd.DataFrame):
            X = X.copy()
        else:
            X = X.copy()
        
        if self.method == 'iqr' and self.treatment == 'clip':
            for i in range(X.shape[1] if isinstance(X, np.ndarray) else len(X.columns)):
                if isinstance(X, pd.DataFrame):
                    col = X.columns[i]
                    X[col] = X[col].clip(
                        self.bounds_[i]['lower'],
                        self.bounds_[i]['upper']
                    )
                else:
                    X[:, i] = np.clip(
                        X[:, i],
                        self.bounds_[i]['lower'],
                        self.bounds_[i]['upper']
                    )
        
        return X
    
    def predict_outliers(self, X):
        """Predict which samples are outliers"""
        if self.method == 'isolation_forest' and self.model_ is not None:
            return self.model_.predict(X) == -1
        return None

# Test the handler
handler = OutlierHandler(method='iqr', treatment='clip')
handler.fit(df[['value']])
df_handled = handler.transform(df[['value']])

print(f"Handler bounds for 'value':")
print(f"  Lower: {handler.bounds_[0]['lower']:.2f}")
print(f"  Upper: {handler.bounds_[0]['upper']:.2f}")
print(f"
Original vs handled (first 5 outliers):")
outlier_idx = df[iqr_outliers].index[:5]
print(pd.DataFrame({
    'Original': df.loc[outlier_idx, 'value'].values,
    'Handled': df_handled.loc[outlier_idx, 'value'].values
}))

# 10. SUMMARY
print("
" + "="*60)
print("10. BEST PRACTICES SUMMARY")
print("="*60)

best_practices = [
    "1. ALWAYS visualize data before outlier detection",
    "2. Use robust methods (IQR, MAD) for skewed distributions",
    "3. For multivariate outliers, use Isolation Forest or LOF",
    "4. Document outlier treatment decisions and thresholds",
    "5. Consider contextual outliers within groups/categories",
    "6. Prefer capping (winsorization) over removal for small datasets",
    "7. Validate that outlier treatment improves model performance",
    "8. Never remove outliers from the target variable",
    "9. Apply outlier detection per-feature, not globally",
    "10. Retain outlier flags as features (the outlier status may be predictive)"
]

for practice in best_practices:
    print(practice)

When to Use

✅ Appropriate Use Cases:

  • Z-score/IQR: Use for univariate detection, normally distributed data, clear statistical definition needed
  • Isolation Forest: Use for multivariate outliers, high-dimensional data, mixed data types
  • LOF: Use when local density matters, clusters of different densities, contextual outliers
  • Winsorization: Use when you want to retain all observations but limit extreme influence
  • Removal: Use when outliers are clearly errors (impossible values), large dataset, >5% outliers
  • Contextual detection: Use when outliers depend on subgroup (e.g., outliers within each product category)

❌ Avoid When:

  • Don't remove outliers without investigation—they may be valuable signals or errors to correct
  • Avoid Z-score on highly skewed data—use modified Z-score or IQR instead
  • Don't use univariate methods when outliers exist in multivariate combinations
  • Avoid aggressive outlier removal on small datasets—reduces statistical power
  • Don't treat outliers differently in train vs test—apply consistent handling
  • Avoid Elliptic Envelope with non-Gaussian data—assumes normal distribution

Common Pitfalls

  • Removing outliers before train/test split—information leakage from test set
  • Using the same threshold for all features—different features need different treatments
  • Not investigating why outliers occur—missed data quality issues
  • Over-removing outliers (>10%)—destroys data distribution and model validity
  • Forgetting to handle outliers in production data—model sees unexpected values
  • Treating multivariate outliers as univariate—some points are normal individually but outlying together