Statistics for ML #6 — Probability Axioms & Rules
Published:
Probability is the mathematical language of uncertainty. Every ML model — from logistic regression to deep neural networks — is built on probability theory.
Kolmogorov’s Three Axioms (1933)
For any event A in sample space Ω:
- Non-negativity: P(A) ≥ 0
- Normalization: P(Ω) = 1
- Additivity: If A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B)
Everything else in probability theory is derived from these three axioms.
Key Rules
Complement Rule: \(P(A^c) = 1 - P(A)\)
Addition Rule (General): \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
Multiplication Rule: \(P(A \cap B) = P(A) \cdot P(B|A)\)
Independence: A and B are independent if: \(P(A \cap B) = P(A) \cdot P(B)\)
Law of Total Probability
If {B₁, B₂, …, Bₙ} is a partition of Ω: \(P(A) = \sum_{i=1}^{n} P(A|B_i) \cdot P(B_i)\)
This is the foundation of marginalisation in Bayesian ML.
ML Connections
Naive Bayes: Assumes feature independence: P(x₁,…,xₙ y) = ∏P(xᵢ y) Logistic Regression: Models P(Y=1 X) directly - Random Forest: Each tree gives a probability estimate; forest averages them
Calibration: Is P̂(Y=1 X=x) truly the probability of the event?
# In sklearn, predict_proba() gives probability estimates
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
probs = model.predict_proba(X_test)[:, 1] # P(Y=1|X)
Previous: #5 Covariance & Correlation | Next: #7 Conditional Probability
