Name: ML Competition — Features & Validation Strategy
Author: KameniAlexNea

Overview

This skill covers the two data-layer concerns that must be correct before any model is worth running:

Feature engineering — encoding strategies, datetime features, aggregations, feature selection, and the versioned pkl cache discipline that prevents stale features from silently degrading results
Validation strategy — choosing the right CV split, accumulating OOF arrays correctly, preventing target leakage, and diagnosing leakage when train OOF ≫ LB

Core invariants:

The feature cache version must be bumped on every change — stale cache is the #1 source of silent regressions
Target encoding and fold-local statistics must always be computed inside cross-validation — computing them on the full dataset before splitting is leakage
OOF arrays hold training rows only, never test rows

Validation Strategy — Critical Rules

This skill covers the two data-layer concerns that must be correct before any model is worth running:

Feature engineering — encoding strategies, datetime features, aggregations, feature selection, and the versioned pkl cache discipline that prevents stale features from silently degrading results
Validation strategy — choosing the right CV split, accumulating OOF arrays correctly, preventing target leakage, and diagnosing leakage when train OOF ≫ LB

Core invariants:

The feature cache version must be bumped on every change — stale cache is the #1 source of silent regressions
Target encoding and fold-local statistics must always be computed inside cross-validation — computing them on the full dataset before splitting is leakage
OOF arrays hold training rows only, never test rows

Natural integrity unit (user ID, entity ID, session ID)	`GroupKFold`
i.i.d. rows with class imbalance, no group column	`StratifiedKFold`
Rows are truly independent and target is balanced	`KFold`
Temporal ordering matters	`TimeSeriesSplit` or rolling-window split — never GroupKFold

File	What it covers
feature-engineering.md	Encoding strategies, datetime features, aggregations, feature selection, cache discipline
validation-strategy.md	GroupKFold / TimeSeriesSplit, OOF accumulation, leakage prevention, leakage checklist

Skill	When to use it instead
`ml-competition`	Full pipeline overview, task type decision guide, first-principles checklist
`ml-competition-setup`	Project structure, RunConfig, process management
`ml-competition-training`	Model training, competition metrics, correct output format
`ml-competition-tuning`	Optuna hyperparameter tuning
`ml-competition-advanced`	Pseudo-labeling, ensemble, post-processing, experiment tracking
`ml-competition-quality`	Coding rules, common pitfalls