RL training reference for the ART framework. Use when the user asks to create, write, or help with an RL training script, reinforcement learning, GRPO, reward functions, RULER scoring, rollout functions, or anything related to RL fine-tuning.
Use this skill when the user wants an RL script or help adapting an existing ART agent for RL.
Keep the process simple:
This skill is an interactive wizard. Do not write the script immediately.
Rules:
You must resolve these before writing the final script:
ServerlessBackend or LocalBackend?You must collect answers for all of these, one at a time, before generating code:
If the repo already makes an answer likely, present that as a recommendation and ask the user to confirm or correct it. That still counts as a question and still requires a user response.
Inspect the repo before asking.
multi-turn RL when episodes can be recreated and rolled out repeatedly from the same initial state.single-turn static training when the task depends on live humans, mutable production systems, or other unreproducible state.If replayability is clear, say so and ask for confirmation. Example:
This looks replayable because each episode starts from fixed local state and the tools only read from it, so I recommend multi-turn RL. Please confirm there is no hidden live dependency.
If it is not clear, ask whether the task has a replayable environment or only logged/static scenarios.
Ask only for the details needed to implement the rollout, but do not skip the required task questions.
Always gather:
For multi-turn tasks, also gather:
When adapting an existing agent:
final_answer directly on the trajectory when useful.Start from RULER as the default.
Use this rule:
programmatic reward only when correctness is robustly checkable with code.RULER for open-text answers or tool-use behavior where exact matching is brittle.custom only when the task genuinely mixes multiple reward sources.Explain RULER briefly once:
RULER is an LLM judge that compares trajectories within a group and scores which ones are better.If the user chooses programmatic reward:
trajectory.reward.trajectory.metrics.If the user chooses RULER:
ruler_score_group(...) with the default rubric.OPENAI_API_KEY validation at startup.openai/gpt-5.4 as the default judge model.For fixed datasets:
0 unless the user asks for it.For validation:
await model.delete_checkpoints(), validation must produce val/reward.Ask for, explicitly and separately:
Do not present a single "recommended starting point" model by default. Offer all allowed base models:
OpenPipe/Qwen3-14B-InstructQwen/Qwen3-30B-A3B-Instruct-2507meta-llama/Llama-3.1-8B-InstructEnvironment requirements:
ServerlessBackend: require WANDB_API_KEYRULER: require OPENAI_API_KEYAsk whether to use these starting defaults or customize them:
1e-542Iteration defaults:
iterate_dataset(..., initial_step=await model.get_step()).These are the main ART-specific rules that matter in practice:
Trajectory.messages_and_choices directly for multi-turn tool use.backend.train(model, trajectory_groups, ...) plus await model.log(...).await backend.close() before exit.art.TrajectoryGroup(...) awaitables directly into art.gather_trajectory_groups(...). Do not await them early.after_each=lambda group: ruler_score_group(...).group.exceptions if you rebuild groups after rollout.max_exceptions to scale with the active batch size, typically args.rollouts_per_group * len(batch.items) for training and the analogous validation batch size. Do not hard-code a small fixed value unless the user explicitly wants that.initial_step=await model.get_step().Use this as the default pattern for fixed datasets with RULER:
from art.rewards import ruler_score_group
from art.utils.iterate_dataset import iterate_dataset
async def rollout(model: TrainableModel, scenario: Scenario) -> art.Trajectory:
...
for batch in iterate_dataset(
train_scenarios,
groups_per_step=args.groups_per_step,
num_epochs=args.num_epochs,
initial_step=await model.get_step(),
):
train_groups = await art.gather_trajectory_groups(
[
art.TrajectoryGroup(
(rollout(model, scenario) for _ in range(args.rollouts_per_group)),
metadata={"scenario_id": scenario.id},
)
for scenario in batch.items
],
after_each=lambda group: ruler_score_group(
group,
judge_model=args.judge_model,
),
max_exceptions=args.rollouts_per_group * len(batch.items),
)
train_result = await backend.train(
model,
train_groups,
learning_rate=args.learning_rate,
)
await model.log(
train_groups,
metrics=train_result.metrics,
step=train_result.step,
split="train",
)
if should_validate(train_result.step):
val_groups = await art.gather_trajectory_groups(
[
art.TrajectoryGroup(
(rollout(model, scenario) for _ in range(args.rollouts_per_group)),
metadata={"scenario_id": scenario.id},
)
for scenario in validation_scenarios
],
after_each=lambda group: ruler_score_group(
group,
judge_model=args.judge_model,
),
max_exceptions=args.rollouts_per_group * len(validation_scenarios),
)
await model.log(
val_groups,
metrics={"reward": ...},
step=train_result.step,
split="val",
)
await model.delete_checkpoints()
Every generated script should:
If you fail to find enough information from the repo, say what is missing and ask the next single blocking question. Do not fabricate environment behavior, reward logic, or dataset structure.