Diagnose CUDA RL and GRPO runs that successfully load the model and pass a raw generate canary, but hang at 0% on the first trainer step. Use for first-step stalls, rollout hangs, old-kernel runtime issues, trainer/generate deadlocks, multimodal-to-text training edge cases, and slow-path generation inside TRL.
Use this skill when:
model.generate() canary succeedsDo not use this skill for:
Treat this as a runtime-stage failure, not a model-stage failure.
The system has already proven:
So the failure is between:
Always classify the first-step stall into one primary bucket:
Do not jump to reward-shaping conclusions unless at least one trainer step completed.
Prove the boundary
Inspect host/runtime warnings
Inspect rollout cost
Inspect trainer internals
Inspect model path
Inspect performance path
Reduce to a smoke test
State exactly where the run stalls and why the issue is a trainer/runtime hang.
Show the last successful stage and the first stage that never returns.
List top likely causes in ranked order with evidence.
Give the smallest config/runtime change likely to make step 0 complete.
List exact files, config keys, env vars, and parameter edits.
Success means:
model.generate() passes but trainer hangs, blame trainer/runtime path before blaming weights.Prefer proving the exact hang boundary. Prefer a tiny 1-step smoke test over another long run. Prefer native text-only causal LM paths over VLM-extraction hacks for CUDA-code RL.