Statistics for ML #5 — Covariance & Correlation

1 minute read

Published: January 05, 2026

Covariance and correlation measure the linear relationship between two variables — the foundation of regression, PCA, and feature selection.

Covariance

\[\text{Cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\]

Positive: Both move in the same direction
Negative: They move in opposite directions
Zero: No linear relationship (but could be non-linear!)
Problem: Scale-dependent — hard to compare across variables

Pearson Correlation Coefficient

\[r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y} \quad \in [-1, 1]\]

Interpretation:

\|r\|	Strength
0.0–0.1	Negligible
0.1–0.3	Weak
0.3–0.5	Moderate
0.5–0.7	Strong
0.7–1.0	Very strong

Assumptions: Linear relationship, no extreme outliers, approximately normal distributions.

Spearman Rank Correlation

\[r_s = 1 - \frac{6\sum d_i^2}{n(n^2-1)}\]

Use when: ordinal data, non-linear monotonic relationship, outliers present.

In DHS data: correlation between wealth quintile (ordinal) and ANC visits → use Spearman, not Pearson.

Kendall’s Tau — for small samples with many ties

Point-Biserial Correlation — continuous vs. binary variable

The Anscombe Quartet Warning

Four datasets with identical Pearson r = 0.816, mean, and variance — but completely different visual patterns. Always plot your data before computing correlation.

import seaborn as sns
import pandas as pd

# Correlation matrix
corr_matrix = df[numeric_cols].corr(method='pearson')

# Visualise
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm',
            center=0, fmt='.2f', square=True)

# Spearman for ordinal
df[['wealth_q', 'anc_visits']].corr(method='spearman')

# R: Multiple correlation types
cor(df$anc_visits, df$wealth_index, method = "pearson")
cor(df$anc_visits, df$wealth_index, method = "spearman")

# Correlation with p-values
library(Hmisc)
rcorr(as.matrix(df[, numeric_vars]), type = "pearson")

Previous: #4 Skewness & Kurtosis | Next: #6 Probability Axioms

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Md Salek Miah

Statistics for ML #5 — Covariance & Correlation

Covariance

Pearson Correlation Coefficient

Spearman Rank Correlation

Kendall’s Tau — for small samples with many ties

Point-Biserial Correlation — continuous vs. binary variable

The Anscombe Quartet Warning

Share on

You May Also Enjoy

Future Blog Post

Statistics for ML #97 — Time Series Analysis: ARIMA, ACF, PACF

Time Series Analysis: ARIMA, ACF, PACF

Statistics for ML #96 — Autoencoders & VAE

Autoencoders & VAE

Statistics for ML #95 — Vanishing & Exploding Gradients

Vanishing & Exploding Gradients