스킬 파일

Ml Pipeline

Name: Ml Pipeline
Author: openclaw

Complete machine learning pipeline for trading: feature engineering, AutoML, deep learning, and financial RL. Use for automated parameter sweeps, feature creation, model training, and anti-leakage validation.

openclaw4,189 스타2026. 2. 28.

직업
카테고리: 머신러닝

스킬 내용

Unified skill for the complete ML pipeline within a quant trading research system. Consolidates eight prior skills into a single authoritative reference covering the full lifecycle: data validation, feature creation, selection, transformation, anti-leakage checks, pipeline automation, deep learning optimization, and deployment.

1. When to Use

Activate this skill when the task involves any of the following:

Creating, selecting, or transforming features for an ML-driven strategy.
Auditing an existing feature pipeline for data leakage or overfitting risk.
Automating an end-to-end ML pipeline (data prep through model export).
Evaluating feature importance, scaling, encoding, or interaction effects.
Integrating features with a feature store (Feast, Tecton, custom Parquet store).
Explaining core ML concepts (bias-variance, cross-validation, regularisation) in the context of feature engineering decisions.

2. Inputs to Gather

Before starting work, collect or confirm:

Input	Details

관련 스킬

Ml Pipeline | Skills Pool

[ ] Labels computed strictly from future returns (no overlap with features)
[ ] All rolling features shifted by at least 1 bar
[ ] Target encoding uses in-fold means only
[ ] Walk-forward or purged CV used (no random shuffle on time-series)
[ ] Embargo gap >= max(label_horizon, autocorrelation_lag)
[ ] Universe is point-in-time (no survivorship bias)
[ ] No global scaling fitted on full dataset (fit on train, transform test)

Step	Action
1. Define requirements	Problem type, evaluation metric, time/resource budget, interpretability needs.
2. Data infrastructure	Load data, quality assessment, train/val/test split strategy, define feature transforms.
3. Configure AutoML	Select framework, define algorithm search space, set preprocessing steps, choose tuning strategy (Bayesian, random, Hyperband).
4. Execute training	Run automated feature engineering, model selection, hyperparameter optimisation, cross-validation.
5. Analyse & export	Compare models, extract best config, feature importance, visualisations, export for deployment.

pipeline_config = {
    "task_type": "classification",        # or "regression", "time_series"
    "time_budget_seconds": 3600,
    "algorithms": ["rf", "xgboost", "catboost", "lightgbm"],
    "preprocessing": ["scaling", "encoding", "imputation"],
    "tuning_strategy": "bayesian",        # or "random", "hyperband"
    "cv_folds": 5,
    "cv_type": "purged_kfold",            # or "walk_forward"
    "embargo_bars": 10,
    "early_stopping_rounds": 50,
    "metric": "sharpe_ratio",             # domain-specific metric
}

import pytorch_lightning as pl

class TradingModel(pl.LightningModule):
    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=1e-4)
        scheduler = torch.optim.lr_scheduler.OneCycleLR(
            optimizer, max_lr=1e-3, total_steps=self.trainer.estimated_stepping_batches
        )
        return [optimizer], [scheduler]

Problem	Cause	Fix
AutoML search finds no good model	Insufficient time budget or poor features	Increase budget, engineer better features, expand algorithm search space.
Out of memory during training	Dataset too large for available RAM	Downsample, use incremental learning, simplify feature engineering.
Model accuracy below threshold	Weak signal or overfitting	Collect more data, add domain-driven features, regularise, adjust metric.
Feature transforms produce NaN/Inf	Division by zero, log of negative	Add guards: `np.where(denom != 0, ...)`, `np.log1p(np.abs(x))`.
Optimiser fails to converge	Bad hyperparameter ranges	Tighten search bounds, increase iterations, exclude unstable algorithms.

Script	Purpose
`data_validation.py`	Validate input data quality before pipeline execution.
`model_evaluation.py`	Evaluate trained model performance and generate reports.
`pipeline_deployment.py`	Deploy a trained pipeline to a target environment with rollback support.
`feature_engineering_pipeline.py`	End-to-end feature engineering: load, clean, transform, select, train.
`feature_importance_analyzer.py`	Analyse feature importance (permutation, SHAP, tree-based).
`data_visualizer.py`	Visualise feature distributions and relationships to target.
`feature_store_integration.py`	Integrate with feature stores (Feast, Tecton) for online/offline serving.

Objective	Target metric (Sharpe, accuracy, RMSE ...), constraints, time horizon.
Data	Symbols / instruments, timeframe, bar type, sampling frequency, data sources.
Leakage risks	Point-in-time concerns, survivorship bias, look-ahead in labels or features.
Compute budget	CPU/GPU limits, wall-clock budget for AutoML search.
Latency	Online vs. offline inference, acceptable prediction latency.
Interpretability	Regulatory or research need for explainable features / models.
Deployment target	Where the model will run (notebook, backtest harness, live engine).

Optimizer	Best For	Learning Rate
Adam	Most cases, adaptive	1e-3 to 1e-4
AdamW	Transformers, weight decay	1e-4 to 1e-5
SGD + Momentum	Large batches, fine-tuning	1e-2 to 1e-3
RAdam	Stability without warmup	1e-3

Ml Pipeline

1. When to Use

2. Inputs to Gather

Ml Pipeline

1. When to Use

2. Inputs to Gather

3. Feature Creation Patterns

3.1 Numerical Features

3.2 Categorical Features

3.3 Time-Series Specific

3.4 Feature Selection Techniques

4. Anti-Leakage Checks

4.1 Label Leakage

4.2 Feature Leakage

4.3 Cross-Validation Leakage

4.4 Survivorship & Selection Bias

4.5 Validation Checklist

5. Pipeline Automation (AutoML)

5.1 Prerequisites

5.2 Pipeline Steps

5.3 Pipeline Configuration Template

5.4 Output Artifacts

6. Core ML Fundamentals (Feature-Engineering Context)

6.1 Bias-Variance Trade-off

6.2 Evaluation Strategy

6.3 Feature Scaling

6.4 Handling Missing Data

7. Workflow

8. Deep Learning Optimization

8.1 Optimizer Selection

8.2 Learning Rate Scheduling

8.3 Regularization Techniques

8.4 PyTorch Lightning Integration

8.5 Financial Reinforcement Learning

9. Error Handling

10. Bundled Scripts

11. Resources

Frameworks

Key References

Best Practices

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns