Name: Ds Roast
Author: arthurcuri

SkillsPool

Search skills.../

*.sql

*.parquet

*.csv

Preprocessing leakage: scaler.fit_transform(X) on the full dataset before splitting. The scaler has seen test set statistics. Fix: fit only on X_train, apply .transform() on X_val and X_test.
Target leakage: any feature derived from or correlated with the target that would not exist at prediction time (e.g., post-event flags, future aggregates).
Temporal leakage: using data from the future to predict the past. A train_test_split(shuffle=True) on time-series data is an instant credibility destroyer. Fix: always use TimeSeriesSplit or manual chronological cutoffs.
Group leakage: the same user/entity/patient appears in both train and test. Fix: split by group ID, not by row.
SMOTE/augmentation leakage: oversampling applied before the train/test split. Fix: apply SMOTE only inside the training fold, never before splitting.
Feature selection leakage: selecting features based on correlation with target computed on the full dataset. Fix: wrap feature selection inside the CV pipeline.
Target encoding leakage: computing target-encoded values using the full training set within a CV loop. Fix: use TargetEncoder inside a Pipeline so it sees only the current fold's training data.
Imputation leakage: fitting an imputer (mean, median, KNN) on the full dataset. Same fix as scaling — fit on train fold only.
Normalization before split: any df['col'] = (df['col'] - df['col'].mean()) / df['col'].std() applied to the full dataframe.
Prediction window contamination (same-period aggregation leakage): the most insidious leakage in business/time-aggregated problems. If the target is an aggregate over a period (e.g., total revenue in month M), any feature that is also computed over period M is leaked — it did not exist at prediction time. This applies equally to SQL queries, pandas groupbys, and feature engineering pipelines.

Examples of this pattern:
- Predicting total sales value in month M using quantity sold in month M as a feature → the quantity is a direct component of the target, computed over the same window.
- Predicting churn in month M using number of support tickets in month M → tickets in M are not known at the start of M when the prediction is made.

Duplicate rows not checked: duplicate observations silently over-weight certain samples. A df.duplicated().sum() returning anything other than 0 that was not deliberately handled is a bug. For time-series, check duplicates by (entity_id, timestamp) not just by all columns.
Unseen categories at inference: a one-hot encoder or LabelEncoder fitted on training categories will crash or silently produce an all-zero vector when it encounters a new category at serving time. Fix: use handle_unknown='ignore' for OHE or handle_unknown='use_encoded_value' for ordinal encoding; test explicitly with out-of-vocabulary inputs.
High-cardinality one-hot encoding: OHE on a column with 10,000 unique values produces 10,000 sparse features. This explodes memory, slows training, and introduces near-useless features. Fix: use target encoding, frequency encoding, or embedding for high-cardinality categoricals.
Column order mismatch between train and inference: model.predict(df) where df columns are in a different order than during training. Tree models (sklearn) silently use wrong features. Neural nets crash. Fix: enforce column order explicitly; serialize the expected schema alongside the model.
Unit and scale inconsistencies: mixing km and miles, USD and EUR, grams and kilograms in the same feature column. These produce wildly wrong feature values with no error. Audit units at ingestion, enforce them in the schema.
Date parsing ambiguity: pd.to_datetime('01/02/2024') is January 2nd in one locale and February 1st in another. Explicitly pass dayfirst or format — never rely on pandas inference.
String whitespace and case inconsistencies: 'Male', 'male', ' male ' treated as three distinct categories. Always normalize text fields at ingestion with .str.strip().str.lower().
Implicit zeros from missing join keys: a LEFT JOIN that does not match fills with NULL, which fillna(0) silently converts to a real value. A customer with zero transactions is different from a customer not in the transactions table at all. Model them explicitly.

Data Pipeline
  ❌ Preprocessing fitted only on training folds
  ❌ Data versioned (DVC / Delta Lake / LakeFS)
  ❌ Schema validated at ingestion (Pandera / Great Expectations)
  ❌ Reproducible data splits (fixed seed + stratification where appropriate)
  ❌ No temporal leakage (check time-series splits)
  ❌ All aggregated features use periods strictly prior to the prediction point (lag >= prediction horizon)
  ❌ Rolling/window features shifted to exclude target period data

Modeling
  ❌ Baseline model defined and beaten
  ❌ Cross-validation with proper fold strategy
  ❌ Metrics appropriate for problem type
  ❌ Hyperparameters tuned on validation set only
  ❌ Model calibration checked (for probabilistic outputs)
  ❌ Uncertainty quantification (confidence intervals on metrics)

MLOps
  ❌ Experiment tracking with parameters + metrics + artifacts
  ❌ Random seeds fixed globally
  ❌ Dependencies fully pinned (lockfile present)
  ❌ Model artifacts versioned and registered
  ❌ Pipeline orchestrated (not manually executed)
  ❌ Model evaluation gate before promotion

Serving & Monitoring
  ❌ Training and inference use the same serialized preprocessing pipeline
  ❌ Input schema validated at serving time
  ❌ Data drift and concept drift monitoring
  ❌ Prediction distribution monitoring

Code Quality
  ❌ No preprocessing logic duplicated between train and inference
  ❌ No row-wise loops on DataFrames
  ❌ Correct dtypes throughout
  ❌ Logging instead of print
  ❌ Unit tests for feature transforms
  ❌ Configuration externalized (not hardcoded)

1. [CRITICAL / LOW EFFORT]   ...
2. [CRITICAL / LOW EFFORT]   ...
3. [HIGH IMPACT / LOW EFFORT] ...
4. [HIGH IMPACT / MEDIUM EFFORT] ...
5. [HIGH IMPACT / MEDIUM EFFORT] ...
6. [MEDIUM IMPACT / LOW EFFORT] ...
7. [MEDIUM IMPACT / HIGH EFFORT] ...

Ds Roast | Skills Pool

Ds Roast

Ds Roast

DS/ML Technical Roast

Step 1: Codebase Reconnaissance

Step 2: The Roast — Structured Technical Review

🔴 Critical Issues — Will Burn You in Production

Data Leakage (the cardinal sin)

Data Quality Issues (Silent Killers)

Other Critical Issues

🟠 Modeling & Methodology Issues

Evaluation

Baselines

Feature Engineering

Modeling Choices

Time-Series Specific

🟡 MLOps & Reproducibility Gaps

Reproducibility

Data & Model Versioning

Experiment Tracking

Pipeline & Orchestration

Serving & Monitoring

🟢 Code Quality & Engineering

Structure & Design

Performance

Error Handling & Observability

Dependencies & Security

💀 The Hall of Shame

Step 3: What Good Looks Like (Reference Checklist)

Step 4: Prioritized Action Plan

Step 5: Verdict

Tone Rules

Deep Research

Data Analyst

Academic Researcher

Data Scientist

Biopython

Binary Analysis Patterns