Detects and prevents data leakage in machine learning and mathematical modeling. Use after ML tasks involving data cleaning, feature engineering, data augmentation, algorithm development, normalization, missing value imputation, dimensionality reduction, feature selection, or time series modeling. Checks if features/statistics would be available at prediction time.
Automatically detects and prevents data leakage in machine learning workflows by verifying that all preprocessing steps, feature engineering, and statistical computations would be available at prediction time.
Use this skill after work involving:
The Golden Rule: At the exact moment of prediction in production, can I access this value from the database or compute it using only information available up to that point?
If the answer is "no" or "not completely", then data leakage exists.
Pattern 1: Preprocessing Before Split
# ❌ WRONG: Leakage - fit on entire dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Uses test set statistics
X_train, X_test = train_test_split(X_scaled, y)
# ✅ CORRECT: Fit only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit on train only
X_test_scaled = scaler.transform(X_test) # Transform test using train statistics
Pattern 2: Global Missing Value Imputation
# ❌ WRONG: Uses global statistics including test set
df['age'].fillna(df['age'].mean(), inplace=True) # Global mean includes test data
X_train, X_test = train_test_split(df, y)
# ✅ CORRECT: Compute statistics on training set only
X_train, X_test, y_train, y_test = train_test_split(df, y)
train_mean = X_train['age'].mean() # Only from training data
X_train['age'].fillna(train_mean, inplace=True)
X_test['age'].fillna(train_mean, inplace=True) # Use train mean for test
Pattern 3: PCA/Dimensionality Reduction on Full Dataset
# ❌ WRONG: PCA learns variance structure from test set
pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X) # Includes test set variance
X_train, X_test = train_test_split(X_reduced, y)
# ✅ CORRECT: Fit PCA only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y)
pca = PCA(n_components=10)
X_train_reduced = pca.fit_transform(X_train) # Learn from train only
X_test_reduced = pca.transform(X_test) # Apply train-learned transformation
Pattern 4: Target Encoding with Full Dataset
# ❌ WRONG: Uses target values from test set
category_means = df.groupby('category')['target'].mean() # Includes test targets
df['category_encoded'] = df['category'].map(category_means)
X_train, X_test = train_test_split(df, y)
# ✅ CORRECT: Compute encoding only from training targets
X_train, X_test, y_train, y_test = train_test_split(df, y)
category_means = X_train.groupby('category')['target'].mean() # Train only
X_train['category_encoded'] = X_train['category'].map(category_means)
X_test['category_encoded'] = X_test['category'].map(category_means)
Pattern 5: Feature Selection on Full Dataset
# ❌ WRONG: Feature selection sees test set
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=10)
X_selected = selector.fit_transform(X, y) # Uses test set for selection
X_train, X_test = train_test_split(X_selected, y)
# ✅ CORRECT: Select features using training data only
X_train, X_test, y_train, y_test = train_test_split(X, y)
selector = SelectKBest(k=10)
X_train_selected = selector.fit_transform(X_train, y_train) # Train only
X_test_selected = selector.transform(X_test) # Apply train-learned selection
Pattern 6: Random Split on Temporal Data
# ❌ WRONG: Random split on time series (uses future to predict past)
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)
# ✅ CORRECT: Time-based split for temporal data
split_date = '2024-01-01'
X_train = df[df['date'] < split_date]
X_test = df[df['date'] >= split_date]
Pattern 7: Future Function in Time Series Features
# ❌ WRONG: Uses future data to compute current features
df['daily_avg'] = df.groupby('date')['value'].transform('mean') # Includes all day's data
# ✅ CORRECT: Use only past data (expanding window)
df = df.sort_values('timestamp')
df['cumulative_avg'] = df.groupby('user_id')['value'].expanding().mean().reset_index(0, drop=True)
Pattern 8: Post-Event Features
# ❌ WRONG: Feature only exists after the outcome
# Predicting loan default using "number of collection calls" as feature
# Collection calls only happen AFTER default occurs
# ✅ CORRECT: Use only pre-event features
# Use features available BEFORE the outcome: credit score, income, debt ratio, etc.
Pattern 9: Leakage in Cross-Validation
# ❌ WRONG: Preprocessing before CV split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
scores = cross_val_score(model, X_scaled, y, cv=5) # Each fold sees other folds' statistics
# ✅ CORRECT: Preprocessing inside CV pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
scores = cross_val_score(pipeline, X, y, cv=5) # Scaling happens per fold
Pattern 10: Data Augmentation Leakage
# ❌ WRONG: Augment before split (test set influenced by augmented train data)
X_augmented = augment_data(X) # Augmentation sees all data
X_train, X_test = train_test_split(X_augmented, y)
# ✅ CORRECT: Augment only training data after split
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_augmented = augment_data(X_train) # Augment train only
# X_test remains unchanged
After any ML preprocessing or feature engineering, verify:
For every feature and preprocessing step, ask:
When the model is deployed and receives a new data point at time T:
- Can I query this value from the database?
- Can I compute this statistic using only data available before time T?
- Does this feature require knowing the outcome I'm trying to predict?
If any answer is "NO", you have data leakage.
Input: Code that normalizes data before train-test split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2)
Leakage Detection:
fit_transform on entire dataset XCorrected Code:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_normalized = scaler.fit_transform(X_train) # Fit on train only
X_test_normalized = scaler.transform(X_test) # Transform using train statistics
Input: Code that fills missing values with global mean
df = pd.read_csv('data.csv')
df['income'].fillna(df['income'].mean(), inplace=True)
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)
Leakage Detection:
Corrected Code:
df = pd.read_csv('data.csv')
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)
train_mean = X_train['income'].mean() # Compute mean from training data only
X_train['income'].fillna(train_mean, inplace=True)
X_test['income'].fillna(train_mean, inplace=True) # Use training mean for test
Input: Stock price prediction with random split and rolling average feature
df['rolling_avg_7d'] = df.groupby('stock')['price'].rolling(7, center=True).mean()
X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)
Leakage Detection:
center=True in rolling window uses future prices (t+3 days) to compute feature at time tCorrected Code:
# Fix 1: Use backward-looking rolling window (no center=True)
df = df.sort_values(['stock', 'date'])
df['rolling_avg_7d'] = df.groupby('stock')['price'].rolling(7, min_periods=1).mean().reset_index(0, drop=True)
# Fix 2: Time-based split instead of random
split_date = '2024-01-01'
X_train = df[df['date'] < split_date]
X_test = df[df['date'] >= split_date]
Input: Category encoding using target mean from full dataset
category_target_mean = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(category_target_mean)
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'])
Leakage Detection:
Corrected Code:
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'])
# Compute target mean only from training data
train_df = X_train.copy()
train_df['target'] = y_train
category_target_mean = train_df.groupby('category')['target'].mean()
X_train['category_encoded'] = X_train['category'].map(category_target_mean)
X_test['category_encoded'] = X_test['category'].map(category_target_mean)
Input: Predicting customer churn using "number of retention calls" as feature
features = ['account_age', 'monthly_spend', 'support_tickets', 'retention_calls_count']
X = df[features]
y = df['churned']
Leakage Detection:
Corrected Code:
# Remove post-event features, use only pre-event features
features = ['account_age', 'monthly_spend', 'support_tickets', 'login_frequency',
'feature_usage_decline', 'payment_delays']
X = df[features]
y = df['churned']
CRITICAL (Model is completely invalid):
HIGH (Significantly inflated performance):
MEDIUM (Moderate performance inflation):
references/leakage-patterns.md: Comprehensive catalog of leakage patternsreferences/temporal-leakage.md: Time series specific leakage issuesreferences/detection-strategies.md: How to detect leakage in existing code