Skip to content

搜索技能.../

Agent Skill Search Engine

搜索

搜索
分类
职业

关于

关于
隐私政策
服务条款

© 2026 Skills Pool. 保留所有权利。

Exact Training Resume Guard | Skills Pool

技能档案

Exact Training Resume Guard

Enforce exact-resume support for long-running training jobs. Use when writing or updating any training script, launcher, sbatch file, DeepSpeed/Accelerate/TRL training entrypoint, or checkpoint policy where future runs must resume from the last step with optimizer, scheduler, RNG, and framework state preserved after timeout, preemption, or manual interruption.

KwongFuk0 星标2026年4月10日

职业
分类: 项目管理

技能内容

Use this skill whenever a task writes or edits training launchers, training entrypoints, or checkpoint logic.

This is not "load model weights and start over." The requirement is exact resume:

continue from the last saved step
preserve optimizer state
preserve scheduler position
preserve RNG state
preserve framework and distributed state such as DeepSpeed sharded checkpoints

Hard Requirements

Save full training state, not model-only checkpoints.

Do not enable options like save_only_model=true for long-running training unless the user explicitly accepts losing exact resume.
Save model, optimizer, scheduler, scaler if used, RNG state, and distributed/framework state needed by the runtime.

Resume from the latest complete checkpoint automatically.

On startup, detect the newest complete checkpoint under the resume directory.
Resume only from checkpoints that contain every file required by the framework.
If a checkpoint is incomplete, skip it and log that decision.

相关技能

快速安装

Exact Training Resume Guard

npx skillvault add KwongFuk/kwongfuk-codex-skills-global-exact-training-resume-guard-skill-md

下载 Skill 打开源码仓库

作者: KwongFuk
星标: 0
更新时间: 2026年4月10日
职业

本页内容

01Hard Requirements

Handle interruption safely.

Trap SIGTERM and SIGINT.
Forward termination to the trainer so it can finish or flush the current save path.
Emit the latest resumable checkpoint path during shutdown.

Keep checkpoints on durable storage.

Checkpoint trees and any checkpoint-writing output_dir must be on durable storage such as /scratch, not ephemeral local scratch and not mixed into lightweight repo logs.

Implementation Checklist

Launcher defaults:
- full-state checkpoint saving enabled
- periodic save interval defined
- resume path configurable but defaulting to the task checkpoint directory
Entry script:
- accepts resume_from_checkpoint
- passes it into the trainer/runtime
- does not silently ignore resume
Checkpoint validation:
- verify the framework-specific state files exist before resuming
- examples: optimizer, scheduler, RNG, trainer state, DeepSpeed shard state
Logging:
- print startup resume source
- print skipped incomplete checkpoints
- print latest resumable checkpoint on shutdown

Framework Guidance

Hugging Face Trainer / TRL:
- prefer trainer.train(resume_from_checkpoint=...)
- keep full checkpoint directories; do not rely on save_model() alone
DeepSpeed:
- require optimizer/model shard state under the checkpoint step directory
- do not call a checkpoint "resumable" if only safetensors weights exist
Accelerate or custom loops:
- persist optimizer, scheduler, scaler, RNG, and dataloader/step counters when applicable

Review Rule

A training script is not done until exact resume has been checked explicitly. If the script only reloads model weights, describe it as restart from weights, not resume training.

02

Implementation Checklist

03Framework Guidance

数据科学家

效率与集成

Things Mac

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database). Use when a user asks OpenClaw to add a task to Things, list inbox/today/upcoming, search tasks, or inspect projects/areas/tags.

效率与集成

Trello

Manage Trello boards, lists, and cards via the Trello REST API.

Production Scheduling

Codified expertise for production scheduling, job sequencing, line balancing, changeover optimization, and bottleneck resolution in discrete and batch manufacturing. Informed by production schedulers with 15+ years experience. Includes TOC/drum-buffer-rope, SMED, OEE analysis, disruption response frameworks, and ERP/MES interaction patterns. Use when scheduling production, resolving bottlenecks, optimizing changeovers, responding to disruptions, or balancing manufacturing lines.

Jira Integration

Use this skill when retrieving Jira tickets, analyzing requirements, updating ticket status, adding comments, or transitioning issues. Provides Jira API patterns via MCP or direct REST calls.

Production Scheduling

为离散和批量制造中的生产调度、作业排序、产线平衡、换模优化和瓶颈解决提供编码化专业知识。基于拥有15年以上经验的生产调度师的知识。包括约束理论/鼓-缓冲-绳、快速换模、设备综合效率分析、中断响应框架以及企业资源计划/制造执行系统交互模式。适用于调度生产、解决瓶颈、优化换模、应对中断或平衡制造产线时。license: Apache-2.0

Cost Aware Llm Pipeline

Cost optimization patterns for LLM API usage — model routing by task complexity, budget tracking, retry logic, and prompt caching.

数据科学家