Smoke-test a model on one sample from a reasoning-gym environment. Shows the exact prompt the model sees, runs inference, then evaluates the environment's own format and correctness reward functions. Use when you want to quickly verify a model or checkpoint works correctly on a specific task.
Smoke-test a model against one sample from a reasoning-gym environment.
$ARGUMENTS[0] — env_name (required): reasoning-gym task name, e.g. countdown, maze, gsm8k.
If missing, ask the user before proceeding.$ARGUMENTS[1] — model_name (optional): HuggingFace ID or local checkpoint path.
Default: Qwen/Qwen2.5-0.5B-InstructRun the diagnostic script from the project root with a 10-minute timeout:
python .claude/skills/try-model-on-env/scripts/diagnostic.py $ARGUMENTS
Show the full output, calling out each phase clearly:
<think> and <answer> tag instructions?apply_chat_templateGive a brief plain-English interpretation:
<think>...</think> and <answer>...</answer> present)environments package to reuse reward functions, answer extraction, and normalization logic.