Distribution Plots: Visualizing Data Shape and Spread

Beginner Eda
~2 min read Eda

Definition

Distribution plots are visualizations specifically designed to reveal the shape, central tendency, and spread of data. Unlike summary statistics alone, distribution plots show the full picture - revealing modes, skewness, outliers, gaps, and clusters. The four primary distribution plots are: histograms (showing frequency counts in bins), kernel density estimation (KDE) plots (smoothed probability density), box plots (summarizing quartiles and outliers), and Q-Q plots (quantile-quantile plots).

Intuition

💡

Think of distribution plots as different ways to photograph a landscape. A histogram is like taking multiple snapshots from different angles. A KDE plot is like a smooth panoramic photo. A box plot is like a topographic map showing key landmarks. A Q-Q plot is like comparing your landscape to a reference photo.

Mathematical Formula

Sturges Rule:
\[ \quad k = \lceil \log_2(n) \rceil + 1 \]
Box Plot Fences:
\[ \quad \text{Lower fence} = Q_1 - 1.5 \times IQR \]
\[ \quad \text{Upper fence} = Q_3 + 1.5 \times IQR \]

Step-by-Step Explanation:

  1. Sturges rule: Estimates optimal number of histogram bins.
  2. Box plot fences: Values outside these fences are considered outliers.

Interactive Demo

Histogram + KDE distribution demo Example Data

Real-World Use Cases

Quality Control

Manufacturing engineers use histograms to monitor product dimensions. A shift in the distribution center indicates a process change.

Finance

Risk analysts examine return distributions using histograms and KDE to identify fat tails.

Healthcare

Clinicians use box plots to compare biomarker levels between healthy and diseased populations.

Implementation

Manual Implementation (No Libraries)

Histograms divide data into bins and count occurrences. Box plots use quartiles and 1.5*IQR fences to identify outliers.
import numpy as np

def create_histogram(data, bins=10):
    counts, edges = np.histogram(data, bins=bins)
    return counts, edges

def box_plot_stats(data):
    q1 = np.percentile(data, 25)
    q2 = np.percentile(data, 50)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return q1, q2, q3, lower, upper

data = np.random.normal(100, 15, 1000)
counts, edges = create_histogram(data)
q1, q2, q3, lower, upper = box_plot_stats(data)
print(f'Median: {q2:.2f}, IQR: {q3-q1:.2f}')

Using Libraries (numpy, matplotlib, seaborn, scipy)

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

data = np.random.normal(100, 15, 1000)

sns.histplot(data, bins=30, kde=True)
plt.title('Histogram with KDE')
plt.show()

sns.boxplot(y=data)
plt.title('Box Plot')
plt.show()

stats.probplot(data, dist='norm', plot=plt)
plt.title('Q-Q Plot')
plt.show()

When to Use

✅ Appropriate Use Cases:

  • Histogram: When you want to see the actual frequency distribution
  • KDE: When you want a smooth representation of probability density
  • Box plot: When comparing distributions across multiple groups
  • Q-Q plot: When testing if data follows a theoretical distribution

❌ Avoid When:

  • Do not use histograms with very small samples (<30)
  • Do not compare histograms with different bin sizes
  • Do not use box plots alone when distribution shape matters

Common Pitfalls

  • Bin size selection: Too many bins show noise; too few obscure patterns.
  • KDE bandwidth selection
  • Ignoring outliers flagged by box plots