Plan and run a series of training experiments, then compare results
Plan, execute, and analyze a series of training runs based on the user's experiment description in $ARGUMENTS.
| Run | Name | Key Changes | Command |
CRITICAL: Run training jobs SEQUENTIALLY, one at a time. NEVER run jobs in parallel — the machine is compute-limited and parallel training will degrade performance for all runs.
For each run:
/train skill conventions:
RAY_ADDRESS= uv run python run_experiment.py train --env <ENV> ...--logdir /tmp/experiments/<experiment_name>/<run_name> for organized outputrun_in_background). Use a generous timeout (600000ms / 10 min).After each run completes, extract these metrics from the training stdout:
Per-iteration metrics (from the table printed each iteration):
Mean Eprew — episode rewardMean Eplen — episode lengthActor loss, Critic lossMean KL Div — policy divergenceMean Entropy — explorationClip Fraction — PPO clipping rateMean noise std — action noiseSummary metrics (from eval and timing lines):
fps — frames per secondAnomaly detection — flag these issues:
nan or inf in any metricFor each completed run, report:
After all runs complete, produce a comparison summary:
Comparison table:
| Run | Final Reward | Peak Eval Reward | Peak Iter | Stable? | Key Hyperparam Diffs |
|-----|-------------|-----------------|-----------|---------|---------------------|
Analysis:
--n-itr 100-500 with --eval-freq 50--no-mirror--num-procs consistent across runs in the same experiment for fair FPS comparisongamma095, lr1e3)