Every training run logs to W&B via WandbLogger:

from lightning.pytorch.loggers import WandbLogger
logger = WandbLogger(project="my-project", name=cfg.name, log_model=False)
trainer = Trainer(logger=logger)

Hyperparameters logged automatically via self.save_hyperparameters() in LightningModule.__init__().

Per-step and per-epoch metrics logged via self.log() and self.log_dict() inside training_step / validation_step:

self.log("train_loss", loss, on_step=True, on_epoch=True, prog_bar=True)
self.log_dict({"val_loss": val_loss, "val_rmse": rmse}, on_epoch=True)

Metric key namespacing followed:
- (no prefix): per-epoch (train_loss, val_loss, eval metrics, lr)
- batch/: per-step metrics (use on_step=True)
- timing/: per-epoch timing breakdown (see Phase 3.5)
- tracking/: per-epoch (best_{metric}, epochs_since_improvement)
- system/: per-epoch (GPU/CPU/RAM — Lightning logs GPU metrics automatically when log_every_n_steps is set)
One-time params logged via logger.experiment.config.update() or self.log_dict() at start: total_params, trainable_params, num_train_samples, num_val_samples, gpu_name.
Model artifact: best model checkpoint uploaded to W&B as versioned artifact via WandbLogger(log_model="all") or manual wandb.log_artifact() — once per run at end of training.

Per-section elapsed time tracked. Use Lightning's built-in profiler or custom callbacks for timing:

from lightning.pytorch.profilers import SimpleProfiler
trainer = Trainer(profiler=SimpleProfiler())  # or AdvancedProfiler() for per-function breakdown

Sections to instrument (via custom callback on_train_batch_start/end, on_validation_start/end):
- timing/env_step_seconds — environment stepping (RL: time inside env.step() or collector.rollout())
- timing/inference_seconds — policy forward pass
- timing/backward_seconds — loss computation + backward pass + optimizer step
- timing/data_seconds — data loading, replay buffer sampling
- timing/overhead_seconds — everything else (logging, checkpointing)
Timing metrics logged to W&B under timing/ namespace via self.log() in callbacks.
(conditional: RL with parallel envs) timing/env_step_seconds measures the full ParallelEnv.step() call including inter-process communication.
Timing fraction logged — compute and log timing/env_step_pct, timing/backward_pct etc. as percentage of total iteration time.

(conditional: neural training) bf16 enabled via Lightning Trainer:
```
trainer = Trainer(precision="bf16-mixed")
```
Not fp16 — bf16 has same exponent range as fp32, no GradScaler needed. Lightning handles AMP context automatically.
No manual torch.amp.autocast() calls in model code — Lightning wraps training_step and validation_step automatically.
No GradScaler in the codebase (bf16 doesn't need it).
(conditional: pre-Ampere GPU) precision="32-true" is set instead.

(conditional: RL with num_envs > 1) Use ParallelEnv (not SerialEnv) to pipeline CPU env stepping with GPU policy inference:

from torchrl.envs import ParallelEnv
env = ParallelEnv(
    num_workers=config.num_envs,
    create_env_fn=lambda: make_env(config),
)

Env creation function must be picklable — use a top-level function or CloudpickleWrapper, not a lambda capturing local state.
env.close() called during cleanup to terminate worker processes (prevents zombie processes).
(conditional: CPU-bound physics like DisMech) ParallelEnv is especially critical — GPU sits idle during serial physics stepping.

RewardSum transform applied: env.append_transform(RewardSum()). Without it, TorchRL collectors never populate episode_reward and monitoring is blind.
All custom observation keys declared in specs — SyncDataCollector silently drops keys not in observation_spec. Add Unbounded specs for diagnostic fields.
(conditional: vectorized envs) Use env.step_and_maybe_reset() (not env.step() + manual step_mdp()) — ParallelEnv.step() does NOT auto-reset done environments.
(conditional: vectorized envs) Env self._device must be torch.device object, not a string — TorchRL's BatchedEnvBase._reset() calls .type on it.

(conditional: num_envs >= 32) Set thread-limiting env vars at script top, before imports:

os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")
os.environ.setdefault("OMP_NUM_THREADS", "1")

Batch structure for v0.11+: done/reward/episode_reward are under batch["next"], not batch root.
Import compatibility: Use try/except for BoundedTensorSpec→Bounded, CompositeSpec→Composite, UnboundedContinuousTensorSpec→Unbounded.
(conditional: PPO with bf16) Never wrap loss module in bf16 autocast — log-prob and importance ratio math requires f32 precision. Only autocast network forward passes.
(conditional: PPO with target_kl) Track actual_updates counter for metric averaging — KL early stopping means actual_updates << num_epochs * num_batches.

(conditional: neural training) Auto batch size tuning via Lightning's BatchSizeFinder callback:

from lightning.pytorch.callbacks import BatchSizeFinder
trainer = Trainer(callbacks=[BatchSizeFinder(mode="binsearch")])

LightningModule and LightningDataModule must expose a batch_size attribute that BatchSizeFinder can modify. Typically set in config and passed through:

class MyDataModule(L.LightningDataModule):
    def __init__(self, batch_size=32, ...):
        self.batch_size = batch_size
    def train_dataloader(self):
        return DataLoader(self.train_ds, batch_size=self.batch_size, ...)

No custom probe_auto_batch_size() functions — use BatchSizeFinder instead.
gradient_accumulation_steps available to extend effective batch beyond VRAM ceiling:
```
trainer = Trainer(accumulate_grad_batches=cfg.gradient_accumulation_steps)
```
(conditional: sequential configs) cleanup_vram() called between configs: delete model, trainer → torch.cuda.empty_cache() → gc.collect().

main() follows this sequence:
1. Parse CLI args
2. Load config + apply CLI overrides
3. Set random seed via lightning.seed_everything(cfg.seed)
4. Instantiate LightningDataModule (handles data loading)
5. Instantiate LightningModule (handles model + optimizer + scheduler)
6. Configure callbacks list (see below)
7. Configure WandbLogger
8. Instantiate Trainer with all settings
9. trainer.fit(model, datamodule=dm) — handles entire training loop
10. trainer.test(model, datamodule=dm, ckpt_path="best") — final eval on best checkpoint
11. Cleanup + wandb.finish()

Callbacks configured on Trainer:

from lightning.pytorch.callbacks import (
    EarlyStopping, ModelCheckpoint, BatchSizeFinder,
    LearningRateMonitor, RichProgressBar,
)
callbacks = [
    EarlyStopping(monitor="val_loss", patience=cfg.patience, mode="min"),
    ModelCheckpoint(monitor="val_loss", mode="min", save_top_k=1, save_last=True),
    BatchSizeFinder(mode="binsearch"),
    LearningRateMonitor(logging_interval="epoch"),
]
trainer = Trainer(callbacks=callbacks, ...)

Early stopping via EarlyStopping callback. Patience measured in eval cycles. Optional min_delta for minimum improvement threshold.
Saves both best model (via ModelCheckpoint(save_top_k=1)) and last model (via save_last=True).
(conditional: multiple configs) Configs run sequentially in a single process, each with its own Trainer + WandbLogger instance.

Lightning handles SIGTERM/SIGINT automatically — saves checkpoint and exits cleanly.

STOP file: custom callback checks for STOP file between epochs via on_train_epoch_end:

class StopFileCallback(L.Callback):
    def on_train_epoch_end(self, trainer, pl_module):
        if Path("STOP").exists():
            trainer.should_stop = True

Both mechanisms handled by setting trainer.should_stop = True, which lets Lightning finish the current epoch and run cleanup.
(conditional: sequential training) Hung process watchdog: polls W&B run status, kills process if alive N minutes after run marked FINISHED. Exit codes 137/143 from watchdog treated as success.

GPU lock: entry point wraps main() with GpuLock() in if __name__ == "__main__":
```
if __name__ == "__main__":
    from src.utils.gpu_lock import GpuLock
    with GpuLock():
        main()
```
Uses flock on /tmp/gpu-task.lock. Concurrent GPU tasks queue (not error).
Pre-flight check before launch: ps aux | grep -E "python.*train" | grep -v grep — confirm GPU is free.

(conditional: run > 5 min) Launch with nohup and unbuffered output:

PYTHONUNBUFFERED=1 nohup <command> > output/<descriptive_log>.txt 2>&1 &

Monitor with tail -f output/<log>.txt or /loop 10m /babysit-training.

Lightning's default output: {output.base_dir}/{name}_{YYYYMMDD_HHMMSS}/
```
trainer = Trainer(default_root_dir=run_dir)
```
Contains: config.json (full snapshot saved manually), console.log, Lightning checkpoint files.
ModelCheckpoint saves to {run_dir}/checkpoints/: best.ckpt, last.ckpt.
Metrics logged to W&B (primary) and optionally to metrics.jsonl via custom callback.
(conditional: tree models) Also includes metrics.json (final summary) and plots/ (feature importance, pred vs actual, SHAP).

After launching a background training run, start babysit monitoring:
```
/loop 10m /babysit-training
```
Covers: process health, metric trending, GPU/system checks, checkpoint integrity, hung process detection, auto-restart from checkpoint, issue documentation.

ML Training Pipeline Checklist | Skills Pool

ML Training Pipeline Checklist

ML Training Pipeline Checklist

Phase 1: Configuration

Phase 2: Codebase Structure (Lightning)

Phase 3: W&B Experiment Tracking (Lightning)

Phase 3.5: Training Loop Profiling

Phase 4: Mixed Precision (bf16)

Phase 4.5: Async Environment Stepping (RL)

Phase 4.6: TorchRL Environment & Collection Safety (RL)

Phase 5: VRAM Management (Lightning BatchSizeFinder)

Phase 6: Training Loop Structure (Lightning Trainer)

Phase 7: Graceful Lifecycle

Phase 8: Entry Point & Execution

Phase 9: Output Directory

Phase 10: Hyperparameter Search

Phase 11: Monitoring

Phase 12: Documentation

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns