Name: ML Pipeline Skill
Author: Phife726

ML Pipeline Skill

Builds reproducible, production-quality machine learning pipelines using scikit-learn's Pipeline and ColumnTransformer. Use this skill whenever the user needs to chain preprocessing and modeling steps together, prevent data leakage, handle mixed feature types (numeric + categorical), serialize a trained model, build a reusable ML workflow, or automate feature engineering within a pipeline. Trigger when the user says "build a pipeline", "automate my preprocessing", "I keep getting data leakage", "make this reproducible", "deploy this model", "save my model", "chain these steps together", or when you notice the user is fitting transformers on the full dataset before splitting (a leakage red flag). Also use when the user wants to do GridSearchCV across both preprocessing and model parameters simultaneously.

Phife7260 スター2026/03/09

職業
カテゴリ: 機械学習

This skill helps you build end-to-end machine learning pipelines that are reproducible, leak-free, and production-ready. If the user is doing preprocessing and modeling as separate ad-hoc steps, this skill shows them the right way.

Why Pipelines Matter

The most common and most dangerous mistake in applied ML is data leakage — when information from the test set contaminates the training process. This happens silently whenever you:

Scale features using statistics computed on the full dataset
Impute missing values with the overall mean before splitting
Encode categories using frequencies from all rows
Select features based on correlations computed on all data

Pipelines prevent leakage by bundling preprocessing and modeling into a single object. When you call pipeline.fit(X_train, y_train), every step sees only training data. When you call pipeline.predict(X_test), the same transformations are applied using parameters learned from training only.

Building a Pipeline Step-by-Step

Step 1: Identify Feature Types

ML Pipeline Skill

Phife7260 スター2026/03/09

職業
カテゴリ: 機械学習

Why Pipelines Matter

The most common and most dangerous mistake in applied ML is data leakage — when information from the test set contaminates the training process. This happens silently whenever you:

Scale features using statistics computed on the full dataset

Impute missing values with the overall mean before splitting

Encode categories using frequencies from all rows

Select features based on correlations computed on all data

ML Pipeline Skill

Why Pipelines Matter

Building a Pipeline Step-by-Step

Step 1: Identify Feature Types

ML Pipeline Skill

Why Pipelines Matter

Building a Pipeline Step-by-Step

Step 1: Identify Feature Types

Step 2: Build Preprocessing Transformers

Step 3: Create the Full Pipeline

Step 4: Train and Evaluate

Step 5: Hyperparameter Tuning Through the Pipeline

Step 6: Custom Transformers

Step 7: Serialize the Pipeline

Common Pipeline Patterns

Red Flags to Watch For

Output Checklist

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns