Generate and run robust pretraining pipelines with background execution, live monitoring, and checkpoint-safe recovery.
Use this skill for end-to-end model training execution, optimized for from-scratch pretraining.
Generate/adapt training workflows for:
Training scripts should include at minimum:
ssh to run training commands on provisioned instances.screen or tmux to keep processes alive.tail -fkernel-metadata.json with enable_gpu: true and enable_internet: true.kaggle kernels push -p <path> to start training.kaggle kernels status <user>/<kernel>.kaggle kernels output <user>/<kernel>.For long runs, prefer background terminal execution:
execbackground: truepty: true for interactive-safe outputThis keeps the agent responsive while training proceeds asynchronously.
Use process:log to stream and inspect ongoing training output.
Parse and report key metrics:
If divergence appears (NaN, exploding loss, repeated OOM), propose immediate mitigation and recovery.
Support staged curricula:
For each phase, define target steps/tokens, LR policy, and success criteria.
Append compute cost entries to the project-level cost_ledger.json:
train.sh or equivalent launcher scripttraining_config.yaml (or framework-native config)training_status.md (current metrics + risk notes)checkpoints_index.json