Distribution Analysis: Understanding Data Shape and Patterns

Beginner Eda

~2 min read Eda

Prerequisites:

Descriptive Statistics: Measures of Central Tendency and Dispersion basic-python

Definition

Distribution analysis examines how data values are spread across the range of possible values. A distribution describes the frequency or probability of different outcomes occurring in a dataset. Understanding distributions is fundamental to statistics because it determines which analytical methods are appropriate and how to interpret results. The most famous distribution is the normal (Gaussian) distribution, characterized by its bell-shaped curve, but real-world data often follows other patterns like skewed distributions, log-normal distributions, or power law distributions.

Intuition

💡

Think of a distribution as a snapshot of where data likes to 'hang out.' Imagine dropping thousands of balls onto a pegboard - the normal distribution is what you see at the bottom: most balls cluster in the center, with fewer at the extremes. Now imagine income distribution: most people earn modest amounts, but a long tail stretches to the right with a few high earners - this is right-skewed.

Mathematical Formula

Normal Distribution PDF:

\quad f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Skewness:

\quad \text{Skewness} = \frac{1}{n} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s}\right)^3

Z-score:

\quad z = \frac{x - \mu}{\sigma}

Step-by-Step Explanation:

Normal PDF: The bell curve formula where mu is the mean and sigma is the standard deviation.
Skewness: Measures asymmetry. Positive values indicate right skew.
Z-score: Standardizes values to mean 0, standard deviation 1.

Real-World Use Cases

Finance

Stock returns often approximate normal distribution for short time periods, but asset prices follow log-normal distribution.

Income Economics

Income distributions are typically right-skewed with long tails. Using mean income would overestimate typical earnings.

Quality Control

Manufacturing tolerances assume normal distribution of measurements.

Implementation

Manual Implementation (No Libraries)

The skewness calculation uses the standardized third moment - positive values indicate right tails, negative indicate left tails.

import numpy as np

def calculate_skewness(data):
    n = len(data)
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    if std == 0:
        return 0
    return (1/n) * sum(((x - mean) / std) ** 3 for x in data)

data = np.random.normal(100, 15, 1000)
print(f'Skewness: {calculate_skewness(data):.3f}')

Using Libraries (numpy, scipy)

import numpy as np
from scipy import stats

data = np.random.normal(100, 15, 1000)
print(f'Skewness: {stats.skew(data):.3f}')
print(f'Kurtosis: {stats.kurtosis(data):.3f}')
z_scores = stats.zscore(data)
print(f'Z-scores range: {z_scores.min():.2f} to {z_scores.max():.2f}')

When to Use

✅ Appropriate Use Cases:

Normal distribution: Parametric tests, control charts, process capability analysis.
Log-normal: Income, stock prices, particle sizes, city populations.
Power law: Network analysis, earthquake magnitudes, word frequencies.

❌ Avoid When:

Do not assume normality without testing - many real-world datasets are not normal.
Do not apply normal-based methods to heavily skewed data without transformation.

Common Pitfalls

Assuming normality: Many analysts default to normal distribution assumptions without verification.
Ignoring sample size: Small samples may appear normal by chance.

Previous Descriptive Statistics: Measures of Central Tendency and Dispersion Next Distribution Plots: Visualizing Data Shape and Spread