Build, review, and debug feature engineering and cross-validation pipelines for tabular ML competitions. Use when: implementing or caching engineered features; choosing and implementing the correct CV split strategy (GroupKFold, StratifiedKFold, TimeSeriesSplit, KFold); preventing target-leakage in fold-local encodings; structuring OOF accumulation arrays; diagnosing train/LB gap caused by leakage. NOT for model training, metrics, or tuning.
This skill covers the two data-layer concerns that must be correct before any model is worth running:
Core invariants:
| Data condition | Split to use |
|---|
| Natural integrity unit (user ID, entity ID, session ID) | GroupKFold |
| i.i.d. rows with class imbalance, no group column | StratifiedKFold |
| Rows are truly independent and target is balanced | KFold |
| Temporal ordering matters | TimeSeriesSplit or rolling-window split — never GroupKFold |
oof = np.zeros(len(train_df)) # allocate once, full training length
for fold, (tr_idx, va_idx) in enumerate(splits):
model.fit(X[tr_idx], y[tr_idx])
oof[va_idx] = model.predict_proba(X[va_idx])[:, 1] # assign by index, not stack
oof[val] = preds, not oof_list.append(preds.mean())oof[va_idx[va_idx < n_train]]# ✅ CORRECT — computed inside fold loop, only on training portion
for fold, (tr_idx, va_idx) in enumerate(splits):
te = train_df.iloc[tr_idx].groupby("cat_col")["target"].mean()
train_df.loc[va_idx, "cat_te"] = train_df.loc[va_idx, "cat_col"].map(te)
# ❌ WRONG — computed before split, leaks validation targets into training features
train_df["cat_te"] = train_df.groupby("cat_col")["target"].transform("mean")
FEAT_CACHE = cfg.feat_cache # e.g. "cache/features_v3.pkl"
def load_or_build_features(df):
if os.path.exists(FEAT_CACHE):
return pd.read_pickle(FEAT_CACHE)
feats = engineer_features(df)
feats.to_pickle(FEAT_CACHE)
return feats
Hard rules:
config.yaml (feat_cache: cache/features_v3.pkl) on every change to engineer_features()build_model_matrices() is called once per process and cached in a module-level variable; do not call it inside the fold loop| File | What it covers |
|---|---|
| feature-engineering.md | Encoding strategies, datetime features, aggregations, feature selection, cache discipline |
| validation-strategy.md | GroupKFold / TimeSeriesSplit, OOF accumulation, leakage prevention, leakage checklist |
| Skill | When to use it instead |
|---|---|
ml-competition | Full pipeline overview, task type decision guide, first-principles checklist |
ml-competition-setup | Project structure, RunConfig, process management |
ml-competition-training | Model training, competition metrics, correct output format |
ml-competition-tuning | Optuna hyperparameter tuning |
ml-competition-advanced | Pseudo-labeling, ensemble, post-processing, experiment tracking |
ml-competition-quality | Coding rules, common pitfalls |