Statistics for ML #6 — Probability Axioms & Rules

1 minute read

Published:

Probability is the mathematical language of uncertainty. Every ML model — from logistic regression to deep neural networks — is built on probability theory.

Kolmogorov’s Three Axioms (1933)

For any event A in sample space Ω:

  1. Non-negativity: P(A) ≥ 0
  2. Normalization: P(Ω) = 1
  3. Additivity: If A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B)

Everything else in probability theory is derived from these three axioms.

Key Rules

Complement Rule: \(P(A^c) = 1 - P(A)\)

Addition Rule (General): \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)

Multiplication Rule: \(P(A \cap B) = P(A) \cdot P(B|A)\)

Independence: A and B are independent if: \(P(A \cap B) = P(A) \cdot P(B)\)

Law of Total Probability

If {B₁, B₂, …, Bₙ} is a partition of Ω: \(P(A) = \sum_{i=1}^{n} P(A|B_i) \cdot P(B_i)\)

This is the foundation of marginalisation in Bayesian ML.

ML Connections

  • Naive Bayes: Assumes feature independence: P(x₁,…,xₙy) = ∏P(xᵢy)
  • Logistic Regression: Models P(Y=1X) directly
  • Random Forest: Each tree gives a probability estimate; forest averages them
  • Calibration: Is P̂(Y=1X=x) truly the probability of the event?
# In sklearn, predict_proba() gives probability estimates
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
probs = model.predict_proba(X_test)[:, 1]  # P(Y=1|X)

Previous: #5 Covariance & Correlation | Next: #7 Conditional Probability