Statistics for ML #78 — Imbalanced Data — SMOTE & Class Weights

1 minute read

Published:

Imbalanced Data — SMOTE & Class Weights

Post #78/100 in the Statistics for ML series — Md Salek MiahStatistician & ML ResearcherSUST, Bangladesh.

Imbalanced data is ubiquitous in health outcomes research. In DHS data, rare outcomes like maternal mortality or specific diseases can have extreme class imbalance.

Why Standard Models Fail

A model that always predicts the majority class achieves high accuracy but is clinically useless.

Example: If only 5% of women experienced pregnancy loss, a trivial model predicting “no loss” for everyone achieves 95% accuracy but 0% sensitivity.

Solutions

MethodDescriptionWhen to use
SMOTESynthetic Minority Over-samplingSmall-medium datasets
Class weightsPenalise majority misclassificationAny size, simpler
Threshold tuningAdjust decision thresholdPost-training
Ensemble methodsBalancedBaggingClassifierWorks well in practice
Precision-Recall AUCBetter metric than ROC-AUCSevere imbalance
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# SMOTE + Random Forest pipeline (used in my DHS research)
pipeline = Pipeline([
    ('smote', SMOTE(random_state=42, sampling_strategy=0.5)),
    ('clf', RandomForestClassifier(class_weight='balanced', n_estimators=200))
])
pipeline.fit(X_train, y_train)
print(classification_report(y_test, pipeline.predict(X_test)))

Series Index | Post #78/100 | Md Salek Miah | saleksta@gmail.com