Enforce safe Slurm job templates on Palmetto: keep /home for code and lightweight standard results, keep caches/checkpoints/models/data and other large files on /scratch, plus preflight safety checks, periodic low-noise GPU/process monitoring, resumable/atomic result saving, and signal-safe shutdown handling. Use whenever writing or updating sbatch scripts or long-running translation/training/inference jobs.
Use this every time you write or edit a Slurm job script.
This is mandatory for long-running jobs (translation/training/inference/eval).
When the job is a training job, also apply exact-training-resume-guard.
Use palmetto-slurm-workflow alongside this skill for allocation/submission workflow details.
nvidia-smi)/scratch storage policy:
/scratch/$USER$USER is the real Palmetto account before using it in scratch/home paths; if not, set an explicit SCRATCH_ROOT=/scratch/<real_user> and use that consistentlyoutput_dir/scratch/$USER/scratch/<real_user>/home/scratch/$USERoutput_dir, put that output_dir on scratch and keep separate run logs under project-local logs//scratch/$USER/.hf_cache before starting the main workloadHF_HUB_OFFLINE=1, TRANSFORMERS_OFFLINE=1, HF_DATASETS_OFFLINE=1cache preload incomplete instead of downloading during the run.tmp then replace)SIGTERM / SIGINTMONITOR_INTERVAL seconds (default 120)trap ... EXITset -euo pipefailnvidia-smi exists and GPU count meets requirementHF_HOME=/scratch/$USER/.hf_cachePIP_CACHE_DIR=/scratch/$USER/.pip_cache/scratch/$USER/envs/.../scratch/$USER/containers/scratch/$USER/...SCRATCH_ROOT=/scratch/<real_user> over raw $USER expansion if $USER may resolve to coderoutput_dir under /scratch/$USER/...HF_HOME before the main Python commandHF_HUB_OFFLINE=1, TRANSFORMERS_OFFLINE=1, HF_DATASETS_OFFLINE=1#SBATCH -o/-e and code-generated run logs to project-local directories by default--save-every--save-every-batches--save-every-seconds--progress-jsonexact-training-resume-guard; model-only saves do not count as resumable training checkpoints.<output>.progress.json with current language/column/counters.MONITOR_INTERVAL=120SAVE_EVERY=200SAVE_EVERY_BATCHES=10SAVE_EVERY_SECONDS=120LOG_EVERY_BATCHES=10A job script is only done when all are true: