Descriptive Statistics: Measures of Central Tendency and Dispersion

Beginner Eda
~4 min read Eda
Prerequisites:

Definition

Descriptive statistics are methods for organizing, summarizing, and presenting data in a convenient and informative way. They provide simple summaries about the sample and the measures, forming the foundation of quantitative data analysis. Unlike inferential statistics, which make predictions about a population based on a sample, descriptive statistics only describe what the data shows without making broader conclusions. The two main types are measures of central tendency (where the center of the data lies) and measures of dispersion (how spread out the data is). Central tendency includes mean, median, and mode, while dispersion includes range, variance, and standard deviation. Understanding these measures is crucial because they provide the vocabulary for describing any dataset, whether it is customer purchase amounts, test scores, temperature readings, or stock prices.

Intuition

💡

Imagine you are describing a group of people to someone who cannot see them. You might say 'they are mostly in their 30s' (central tendency) and 'ages range from 25 to 45' (dispersion). Descriptive statistics give us precise ways to make these descriptions. The mean is like the balance point of a seesaw - if you placed all data points on a number line with equal weight, the mean is where the fulcrum would balance. The median is the middle person when everyone lines up in order - half are older, half are younger. The standard deviation tells us how 'clumped' or 'spread out' the group is - a small standard deviation means everyone is similar, while a large one means wide diversity.

Mathematical Formula

Population Mean:
\[ \mu = \frac{1}{N} \sum_{i=1}^{N} x_i \]
Sample Mean:
\[ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \]
Population Variance:
\[ \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 \]
Sample Variance (unbiased):
\[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
Standard Deviation:
\[ s = \sqrt{s^2} \]
Median:
\[ For ordered data x_{(1)} \leq x_{(2)} \leq ... \leq x_{(n)} \]
If n is odd:
\[ Median = x_{((n+1)/2)} \]
If n is even:
\[ Median = \frac{x_{(n/2)} + x_{(n/2+1)}}{2} \]

Step-by-Step Explanation:

  1. Mean: Sum all values and divide by the count. For samples, we use n-1 in the denominator for variance (Bessel's correction) to get an unbiased estimator.
  2. Variance: Calculate the squared distance of each point from the mean, then average these squared distances. Squaring ensures all contributions are positive.
  3. Standard Deviation: Take the square root of variance to return to the original units of measurement.
  4. Median: Sort the data and find the middle value. Unlike the mean, it is not affected by extreme outliers.
  5. Mode: The most frequently occurring value. A dataset can have multiple modes or no mode at all.

Interactive Demo

Summary statistics bar chart Example Data

Real-World Use Cases

Finance

Portfolio managers track the mean return and standard deviation of investments to assess risk-return profiles. A stock with high mean return but also high standard deviation is considered volatile.

Healthcare

Epidemiologists use median survival time for cancer patients because survival data often has outliers (long-term survivors). Mean survival would be skewed by a few exceptional cases.

E-commerce

Analysts track average order value (AOV) and its standard deviation to understand customer spending patterns and set appropriate free shipping thresholds.

Quality Control

Manufacturers monitor the mean and standard deviation of product dimensions to ensure consistency. Six Sigma methodology aims to keep variation within 6 standard deviations of the mean.

Education

Test score analysis often reports both mean and median to show if a few very high or low scores are distorting the typical student's performance.

Implementation

Manual Implementation (No Libraries)

The manual implementation shows the mathematical foundations. We compute mean by summing and dividing. For variance, we use the sum of squared deviations from the mean.
import math

def calculate_mean(data):
    if not data:
        return None
    return sum(data) / len(data)

def calculate_median(data):
    if not data:
        return None
    sorted_data = sorted(data)
    n = len(sorted_data)
    mid = n // 2
    if n % 2 == 0:
        return (sorted_data[mid - 1] + sorted_data[mid]) / 2
    else:
        return sorted_data[mid]

def calculate_variance(data, sample=True):
    if len(data) < 2:
        return None
    mean = calculate_mean(data)
    squared_diff_sum = sum((x - mean) ** 2 for x in data)
    if sample:
        return squared_diff_sum / (len(data) - 1)
    return squared_diff_sum / len(data)

heights = [165, 170, 168, 172, 175, 180, 169, 171, 168, 170]
print(f'Mean: {calculate_mean(heights):.2f}')
print(f'Std Dev: {(calculate_variance(heights, sample=True) ** 0.5):.2f}')

Using Libraries (numpy, pandas, scipy)

import numpy as np
import pandas as pd
from scipy import stats

data = [165, 170, 168, 172, 175, 180, 169, 171, 168, 170]
print(f'Mean: {np.mean(data):.2f}')
print(f'Std Dev: {np.std(data, ddof=1):.2f}')
df = pd.DataFrame({'value': data})
print(df.describe())

When to Use

✅ Appropriate Use Cases:

  • Mean: Use when data is symmetric and has no extreme outliers. Best for interval/ratio data.
  • Median: Use when data is skewed or has outliers. Robust against extreme values.
  • Standard Deviation: Use alongside mean to describe spread of symmetric distributions.

❌ Avoid When:

  • Mean: Do not use with heavily skewed data or data with extreme outliers (use median instead).
  • Standard Deviation: Do not use as sole measure of spread for skewed distributions (use IQR).

Common Pitfalls

  • Confusing population vs sample statistics: Using n instead of n-1 for sample variance leads to biased estimates.
  • Ignoring outliers: Always check for outliers before choosing between mean and median.
  • Comparing apples to oranges: Standard deviations of different variables cannot be directly compared.