Distribution Plots: Visualizing Data Shape and Spread
Definition
Distribution plots are visualizations specifically designed to reveal the shape, central tendency, and spread of data. Unlike summary statistics alone, distribution plots show the full picture - revealing modes, skewness, outliers, gaps, and clusters. The four primary distribution plots are: histograms (showing frequency counts in bins), kernel density estimation (KDE) plots (smoothed probability density), box plots (summarizing quartiles and outliers), and Q-Q plots (quantile-quantile plots).
Intuition
Think of distribution plots as different ways to photograph a landscape. A histogram is like taking multiple snapshots from different angles. A KDE plot is like a smooth panoramic photo. A box plot is like a topographic map showing key landmarks. A Q-Q plot is like comparing your landscape to a reference photo.
Mathematical Formula
Step-by-Step Explanation:
- Sturges rule: Estimates optimal number of histogram bins.
- Box plot fences: Values outside these fences are considered outliers.
Interactive Demo
Real-World Use Cases
Manufacturing engineers use histograms to monitor product dimensions. A shift in the distribution center indicates a process change.
Risk analysts examine return distributions using histograms and KDE to identify fat tails.
Clinicians use box plots to compare biomarker levels between healthy and diseased populations.
Implementation
Manual Implementation (No Libraries)
import numpy as np
def create_histogram(data, bins=10):
counts, edges = np.histogram(data, bins=bins)
return counts, edges
def box_plot_stats(data):
q1 = np.percentile(data, 25)
q2 = np.percentile(data, 50)
q3 = np.percentile(data, 75)
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
return q1, q2, q3, lower, upper
data = np.random.normal(100, 15, 1000)
counts, edges = create_histogram(data)
q1, q2, q3, lower, upper = box_plot_stats(data)
print(f'Median: {q2:.2f}, IQR: {q3-q1:.2f}')
Using Libraries (numpy, matplotlib, seaborn, scipy)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
data = np.random.normal(100, 15, 1000)
sns.histplot(data, bins=30, kde=True)
plt.title('Histogram with KDE')
plt.show()
sns.boxplot(y=data)
plt.title('Box Plot')
plt.show()
stats.probplot(data, dist='norm', plot=plt)
plt.title('Q-Q Plot')
plt.show()
When to Use
✅ Appropriate Use Cases:
- Histogram: When you want to see the actual frequency distribution
- KDE: When you want a smooth representation of probability density
- Box plot: When comparing distributions across multiple groups
- Q-Q plot: When testing if data follows a theoretical distribution
❌ Avoid When:
- Do not use histograms with very small samples (<30)
- Do not compare histograms with different bin sizes
- Do not use box plots alone when distribution shape matters
Common Pitfalls
- Bin size selection: Too many bins show noise; too few obscure patterns.
- KDE bandwidth selection
- Ignoring outliers flagged by box plots