Name: Trl First Step Hang Debugger
Author: Bestpart-Irene

Use this skill when:

model loading succeeds
tokenizer setup succeeds
a raw model.generate() canary succeeds
training starts but progress stays at 0%
the run hangs on the first GRPO step
logs show old kernel warnings, accelerate warnings, or very slow rollout startup

Do not use this skill for:

Core rule

Treat this as a runtime-stage failure, not a model-stage failure.

The system has already proven:

So the failure is between:

Use this skill when:

model loading succeeds
tokenizer setup succeeds
a raw model.generate() canary succeeds
training starts but progress stays at 0%
the run hangs on the first GRPO step
logs show old kernel warnings, accelerate warnings, or very slow rollout startup

Do not use this skill for:

Treat this as a runtime-stage failure, not a model-stage failure.

The system has already proven:

So the failure is between:

Trl First Step Hang Debugger