Use when evaluating model performance or reporting results. Triggers on: accuracy, precision, recall, F1, AUC, ROC, confusion matrix, classification report, log loss, Brier score, calibration curve, SHAP, feature importance, MAE, RMSE, R2, mean absolute error, subgroup, fairness, baseline comparison, predict_proba, score, evaluate, metrics, test set, held-out, explainability, shap.Explainer, shap_values, CalibratedClassifierCV, cross_val_predict.
RULE: Never evaluate on the training set — this is a category error, not a simplification. RULE: Reserve a final test set untouched until all model development and tuning is complete — evaluating on test during development and reporting that performance as if it were a fresh evaluation is a form of leakage. RULE: Use cross-validation for model selection and hyperparameter tuning, not the test set. RULE: No single metric tells the complete story — select a combination appropriate to the problem and interpret them together. RULE: Never report accuracy alone for classification — it is misleading for imbalanced classes. Report precision, recall, F1, and AUC together at minimum. RULE: For classifiers outputting probability scores use log loss — it rewards calibrated probabilities which accuracy and F1 do not. RULE: Calibrate probability outputs — a model that says 80% probability should be right 80% of the time. Use calibration curves and Brier scores to diagnose this. RULE: Establish a simple baseline before evaluating any complex model — mean predictor for regression, majority class for classification, logistic regression for tabular data. RULE: Use SHAP values to explain predictions — prefer over model-specific explainers. Built-in feature importances are misleading for correlated features. RULE: For regression report MAE alongside RMSE — MAE is interpretable in domain units, RMSE penalises large errors more heavily. Report both and explain the tradeoff. RULE: Evaluate model performance on meaningful subgroups not just overall — a model that performs well on average but poorly on a minority subgroup is not a good model. RULE: Report effect sizes and confidence intervals on evaluation metrics, not just point estimates. RULE: Document negative results and rejected approaches — what you tried and why it was rejected is as valuable as what worked.
shap.ExplainerRULE: Follow this checklist before reporting any evaluation results:
reports/ including visualisations and tables.