Iteratively optimise a system prompt so that a model generates the correct <think>...</think> and <answer>...</answer> format on a given reasoning-gym task. Loads the model once, evaluates each candidate prompt on N real samples (no API key needed), and reports the best one found. Use when try-model-on-env shows format_reward = 0.0.
Evaluate a pre-planned set of candidate system prompts so a model reliably outputs
<think>...</think><answer>...</answer> format on a reasoning task. No API key required.
$ARGUMENTS[0] — env_name (required): reasoning-gym task name, e.g. countdown, maze, gsm8k.
If missing, ask the user before proceeding.$ARGUMENTS[1] — model_name (optional): HuggingFace ID or local checkpoint path.
Default: Qwen/Qwen2.5-0.5B-InstructRun the local optimizer script from the project root with a 15-minute timeout:
python .claude/skills/get-sys-prompt/scripts/optimize_prompt_local.py $ARGUMENTS
Show the full output, calling out each phase clearly:
think=✓/✗ answer=✓/✗Give a brief plain-English summary:
DEFAULT_SYSTEM_PROMPT
in try-model-on-env/scripts/diagnostic.py.
If it stayed at 0.0, note that SFT warm-up is likely needed before prompt-tuning helps.| Flag | Default | Purpose |
|---|---|---|
--samples N | 3 | Samples evaluated per round |
Example with flags:
python .claude/skills/get-sys-prompt/scripts/optimize_prompt_local.py countdown Qwen/Qwen2.5-0.5B-Instruct --samples 5
claude-sonnet-4-6) to generate new prompts dynamically. Requires ANTHROPIC_API_KEY.build_prompt_string, run_inference, eval_rewards, load_model_and_tokenizer) are imported directly. Also accepts --system-prompt flag for one-off manual tests.