Distribution Plots: Visualizing Data Shape and Spread

Beginner Eda

~2 min read Eda

Prerequisites:

Descriptive Statistics: Measures of Central Tendency and Dispersion Distribution Analysis: Understanding Data Shape and Patterns matplotlib-basics

Definition

Distribution plots are visualizations specifically designed to reveal the shape, central tendency, and spread of data. Unlike summary statistics alone, distribution plots show the full picture - revealing modes, skewness, outliers, gaps, and clusters. The four primary distribution plots are: histograms (showing frequency counts in bins), kernel density estimation (KDE) plots (smoothed probability density), box plots (summarizing quartiles and outliers), and Q-Q plots (quantile-quantile plots).

Intuition

💡

Think of distribution plots as different ways to photograph a landscape. A histogram is like taking multiple snapshots from different angles. A KDE plot is like a smooth panoramic photo. A box plot is like a topographic map showing key landmarks. A Q-Q plot is like comparing your landscape to a reference photo.

Mathematical Formula

Sturges Rule:

\quad k = \lceil \log_2(n) \rceil + 1

Box Plot Fences:

\quad \text{Lower fence} = Q_1 - 1.5 \times IQR

\quad \text{Upper fence} = Q_3 + 1.5 \times IQR

Step-by-Step Explanation:

Sturges rule: Estimates optimal number of histogram bins.
Box plot fences: Values outside these fences are considered outliers.

Interactive Demo

Histogram + KDE distribution demo Example Data

Real-World Use Cases

Quality Control

Manufacturing engineers use histograms to monitor product dimensions. A shift in the distribution center indicates a process change.

Finance

Risk analysts examine return distributions using histograms and KDE to identify fat tails.

Healthcare

Clinicians use box plots to compare biomarker levels between healthy and diseased populations.

Implementation

Manual Implementation (No Libraries)

Histograms divide data into bins and count occurrences. Box plots use quartiles and 1.5*IQR fences to identify outliers.

import numpy as np

def create_histogram(data, bins=10):
    counts, edges = np.histogram(data, bins=bins)
    return counts, edges

def box_plot_stats(data):
    q1 = np.percentile(data, 25)
    q2 = np.percentile(data, 50)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return q1, q2, q3, lower, upper

data = np.random.normal(100, 15, 1000)
counts, edges = create_histogram(data)
q1, q2, q3, lower, upper = box_plot_stats(data)
print(f'Median: {q2:.2f}, IQR: {q3-q1:.2f}')

Using Libraries (numpy, matplotlib, seaborn, scipy)

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

data = np.random.normal(100, 15, 1000)

sns.histplot(data, bins=30, kde=True)
plt.title('Histogram with KDE')
plt.show()

sns.boxplot(y=data)
plt.title('Box Plot')
plt.show()

stats.probplot(data, dist='norm', plot=plt)
plt.title('Q-Q Plot')
plt.show()

When to Use

✅ Appropriate Use Cases:

Histogram: When you want to see the actual frequency distribution
KDE: When you want a smooth representation of probability density
Box plot: When comparing distributions across multiple groups
Q-Q plot: When testing if data follows a theoretical distribution

❌ Avoid When:

Do not use histograms with very small samples (<30)
Do not compare histograms with different bin sizes
Do not use box plots alone when distribution shape matters

Common Pitfalls

Bin size selection: Too many bins show noise; too few obscure patterns.
KDE bandwidth selection
Ignoring outliers flagged by box plots

Previous Distribution Analysis: Understanding Data Shape and Patterns Next Hypothesis Testing: Making Data-Driven Decisions