Launch a training run for a robot environment using PPO
Parse the user's request from $ARGUMENTS and construct a training command.
RAY_ADDRESS= uv run python run_experiment.py train --env <ENV> --logdir <LOGDIR> [OPTIONS...]
| Name | Description |
|---|---|
cartpole | Cartpole swing-up (simplest, good for testing) |
h1 | Unitree H1 standing task |
jvrc_walk | JVRC humanoid basic walking |
jvrc_step | JVRC humanoid stepping with planned footsteps |
| Flag | Default | Description |
|---|---|---|
--n-itr | 20000 | Training iterations |
--lr | 1e-4 | Learning rate |
--gamma | 0.99 | Discount factor |
--std-dev | 0.223 | Action noise |
--learn-std | off | Learn action noise (flag) |
--entropy-coeff | 0.0 | Entropy regularization |
--clip | 0.2 | PPO clipping |
--minibatch-size | 64 | Minibatch size |
--epochs | 3 | Optimization epochs per update |
--num-procs | 12 | Parallel workers |
--num-envs-per-worker | 1 | Vectorized envs per worker |
--max-grad-norm | 0.05 | Gradient clipping |
--max-traj-len | 400 | Episode horizon |
--eval-freq | 100 | Eval every N iterations |
--seed | None | Random seed |
--device | auto | Training device (auto/cpu/cuda) |
--no-mirror | off | Disable symmetry wrapper (flag) |
--recurrent | off | Use LSTM policy (flag) |
--continued | None | Path to pretrained weights |
--logdir /tmp/training_runs unless the user specifies a different path.run_in_background: true on the Bash tool. Set a generous timeout (600000ms).TaskOutput with block: false to check the latest output.For cartpole, these settings are known to work well (from prior experiments):
--gamma 0.99 --lr 3e-5 --minibatch-size 256 --max-grad-norm 0.02--std-dev 0.15 --learn-std --entropy-coeff 0.01--max-traj-len 500 --n-itr 1500 --num-procs 12--no-mirror (cartpole has no body symmetry)Suggest these defaults when the user trains cartpole, but let them override.