Statistics for ML #3 — Measures of Dispersion

1 minute read

Published:

Two datasets can have identical means but completely different spreads. Dispersion measures capture how spread out values are around the center.

Range

\(\text{Range} = x_{\max} - x_{\min}\) Simple but highly sensitive to outliers. Only uses two data points.

Variance

\(\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2 \quad \text{(population)}\) \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 \quad \text{(sample, Bessel's correction)}\)

Why n−1? The sample mean already uses the data, so we lose one degree of freedom. This makes s² an unbiased estimator of σ².

Standard Deviation

\(\sigma = \sqrt{\sigma^2}\) Same units as the original data — interpretable. The most widely used dispersion measure.

Empirical Rule (Normal distribution):

  • ~68% of data within μ ± 1σ
  • ~95% within μ ± 2σ
  • ~99.7% within μ ± 3σ

Interquartile Range (IQR)

\(\text{IQR} = Q_3 - Q_1\) Robust to outliers. Used for outlier detection: values beyond Q1 − 1.5×IQR or Q3 + 1.5×IQR are flagged as outliers.

Coefficient of Variation (CV)

\(CV = \frac{\sigma}{\mu} \times 100\%\) Dimensionless — allows comparison across datasets with different units or scales.

Example: Comparing variability of ANC visits (mean=4, SD=1.5) vs. birth weight (mean=3000g, SD=500g):

  • CV(ANC) = 37.5%
  • CV(birth weight) = 16.7% → ANC visits are relatively more variable.

ML Relevance

ConceptML Application
VarianceBias-variance tradeoff; PCA (maximises variance)
Standard deviationFeature standardisation (Z-score normalisation)
IQRRobust scaling: RobustScaler in scikit-learn
CVComparing feature variability before selection
from sklearn.preprocessing import StandardScaler, RobustScaler

# Z-score (mean=0, std=1) — sensitive to outliers
scaler = StandardScaler()

# IQR-based — robust to outliers, better for health data
robust = RobustScaler()  # subtracts median, divides by IQR

X_scaled = robust.fit_transform(X)

Previous: #2 Central Tendency | Next: #4 Skewness & Kurtosis