Statistics for ML #5 — Covariance & Correlation
Published:
Covariance and correlation measure the linear relationship between two variables — the foundation of regression, PCA, and feature selection.
Covariance
\[\text{Cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\]- Positive: Both move in the same direction
- Negative: They move in opposite directions
- Zero: No linear relationship (but could be non-linear!)
- Problem: Scale-dependent — hard to compare across variables
Pearson Correlation Coefficient
\[r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y} \quad \in [-1, 1]\]Interpretation:
| |r| | Strength |
|---|---|
| 0.0–0.1 | Negligible |
| 0.1–0.3 | Weak |
| 0.3–0.5 | Moderate |
| 0.5–0.7 | Strong |
| 0.7–1.0 | Very strong |
Assumptions: Linear relationship, no extreme outliers, approximately normal distributions.
Spearman Rank Correlation
\[r_s = 1 - \frac{6\sum d_i^2}{n(n^2-1)}\]Use when: ordinal data, non-linear monotonic relationship, outliers present.
In DHS data: correlation between wealth quintile (ordinal) and ANC visits → use Spearman, not Pearson.
Kendall’s Tau — for small samples with many ties
Point-Biserial Correlation — continuous vs. binary variable
The Anscombe Quartet Warning
Four datasets with identical Pearson r = 0.816, mean, and variance — but completely different visual patterns. Always plot your data before computing correlation.
import seaborn as sns
import pandas as pd
# Correlation matrix
corr_matrix = df[numeric_cols].corr(method='pearson')
# Visualise
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm',
center=0, fmt='.2f', square=True)
# Spearman for ordinal
df[['wealth_q', 'anc_visits']].corr(method='spearman')
# R: Multiple correlation types
cor(df$anc_visits, df$wealth_index, method = "pearson")
cor(df$anc_visits, df$wealth_index, method = "spearman")
# Correlation with p-values
library(Hmisc)
rcorr(as.matrix(df[, numeric_vars]), type = "pearson")
Previous: #4 Skewness & Kurtosis | Next: #6 Probability Axioms
