Training Runner

Skill-Datei

Training Runner

Generate and run robust pretraining pipelines with background execution, live monitoring, and checkpoint-safe recovery.

Rachasumanth0 Sterne01.03.2026

Beruf
Kategorien: Machine Learning

Skill-Inhalt

Use this skill for end-to-end model training execution, optimized for from-scratch pretraining.

Framework Support

Generate/adapt training workflows for:

NanoGPT
LitGPT
Hugging Face Trainer

Script Generation Defaults

Training scripts should include at minimum:

cosine learning rate schedule
gradient clipping
mixed precision (bf16/fp16 based on hardware)
deterministic seeds where feasible
structured logging for losses and throughput

Execution Patterns

1. SSH / Remote Execution (RunPod/Vast/Lambda)

Use ssh to run training commands on provisioned instances.
Use screen or tmux to keep processes alive.
Stream logs back via .