Enforce exact-resume support for long-running training jobs. Use when writing or updating any training script, launcher, sbatch file, DeepSpeed/Accelerate/TRL training entrypoint, or checkpoint policy where future runs must resume from the last step with optimizer, scheduler, RNG, and framework state preserved after timeout, preemption, or manual interruption.
Use this skill whenever a task writes or edits training launchers, training entrypoints, or checkpoint logic.
This is not "load model weights and start over." The requirement is exact resume:
save_only_model=true for long-running training unless the user explicitly accepts losing exact resume.SIGTERM and SIGINT.output_dir must be on durable storage such as /scratch, not ephemeral local scratch and not mixed into lightweight repo logs.resume_from_checkpointtrainer.train(resume_from_checkpoint=...)save_model() aloneA training script is not done until exact resume has been checked explicitly. If the script only reloads model weights, describe it as restart from weights, not resume training.