Statistics for ML #5 — Covariance & Correlation

1 minute read

Published:

Covariance and correlation measure the linear relationship between two variables — the foundation of regression, PCA, and feature selection.

Covariance

\[\text{Cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\]
  • Positive: Both move in the same direction
  • Negative: They move in opposite directions
  • Zero: No linear relationship (but could be non-linear!)
  • Problem: Scale-dependent — hard to compare across variables

Pearson Correlation Coefficient

\[r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y} \quad \in [-1, 1]\]

Interpretation:

|r|Strength
0.0–0.1Negligible
0.1–0.3Weak
0.3–0.5Moderate
0.5–0.7Strong
0.7–1.0Very strong

Assumptions: Linear relationship, no extreme outliers, approximately normal distributions.

Spearman Rank Correlation

\[r_s = 1 - \frac{6\sum d_i^2}{n(n^2-1)}\]

Use when: ordinal data, non-linear monotonic relationship, outliers present.

In DHS data: correlation between wealth quintile (ordinal) and ANC visits → use Spearman, not Pearson.

Kendall’s Tau — for small samples with many ties

Point-Biserial Correlation — continuous vs. binary variable

The Anscombe Quartet Warning

Four datasets with identical Pearson r = 0.816, mean, and variance — but completely different visual patterns. Always plot your data before computing correlation.

import seaborn as sns
import pandas as pd

# Correlation matrix
corr_matrix = df[numeric_cols].corr(method='pearson')

# Visualise
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm',
            center=0, fmt='.2f', square=True)

# Spearman for ordinal
df[['wealth_q', 'anc_visits']].corr(method='spearman')
# R: Multiple correlation types
cor(df$anc_visits, df$wealth_index, method = "pearson")
cor(df$anc_visits, df$wealth_index, method = "spearman")

# Correlation with p-values
library(Hmisc)
rcorr(as.matrix(df[, numeric_vars]), type = "pearson")

Previous: #4 Skewness & Kurtosis | Next: #6 Probability Axioms