Distribution Analysis: Understanding Data Shape and Patterns
Definition
Distribution analysis examines how data values are spread across the range of possible values. A distribution describes the frequency or probability of different outcomes occurring in a dataset. Understanding distributions is fundamental to statistics because it determines which analytical methods are appropriate and how to interpret results. The most famous distribution is the normal (Gaussian) distribution, characterized by its bell-shaped curve, but real-world data often follows other patterns like skewed distributions, log-normal distributions, or power law distributions.
Intuition
Think of a distribution as a snapshot of where data likes to 'hang out.' Imagine dropping thousands of balls onto a pegboard - the normal distribution is what you see at the bottom: most balls cluster in the center, with fewer at the extremes. Now imagine income distribution: most people earn modest amounts, but a long tail stretches to the right with a few high earners - this is right-skewed.
Mathematical Formula
Step-by-Step Explanation:
- Normal PDF: The bell curve formula where mu is the mean and sigma is the standard deviation.
- Skewness: Measures asymmetry. Positive values indicate right skew.
- Z-score: Standardizes values to mean 0, standard deviation 1.
Real-World Use Cases
Stock returns often approximate normal distribution for short time periods, but asset prices follow log-normal distribution.
Income distributions are typically right-skewed with long tails. Using mean income would overestimate typical earnings.
Manufacturing tolerances assume normal distribution of measurements.
Implementation
Manual Implementation (No Libraries)
import numpy as np
def calculate_skewness(data):
n = len(data)
mean = np.mean(data)
std = np.std(data, ddof=1)
if std == 0:
return 0
return (1/n) * sum(((x - mean) / std) ** 3 for x in data)
data = np.random.normal(100, 15, 1000)
print(f'Skewness: {calculate_skewness(data):.3f}')
Using Libraries (numpy, scipy)
import numpy as np
from scipy import stats
data = np.random.normal(100, 15, 1000)
print(f'Skewness: {stats.skew(data):.3f}')
print(f'Kurtosis: {stats.kurtosis(data):.3f}')
z_scores = stats.zscore(data)
print(f'Z-scores range: {z_scores.min():.2f} to {z_scores.max():.2f}')
When to Use
✅ Appropriate Use Cases:
- Normal distribution: Parametric tests, control charts, process capability analysis.
- Log-normal: Income, stock prices, particle sizes, city populations.
- Power law: Network analysis, earthquake magnitudes, word frequencies.
❌ Avoid When:
- Do not assume normality without testing - many real-world datasets are not normal.
- Do not apply normal-based methods to heavily skewed data without transformation.
Common Pitfalls
- Assuming normality: Many analysts default to normal distribution assumptions without verification.
- Ignoring sample size: Small samples may appear normal by chance.