Methodology for exploring, testing, and archiving reward/penalty functions for VBot quadruped navigation. A process-oriented guide for systematic reward discovery.
This skill teaches the methodology of reward/penalty exploration — how to discover, test, evaluate, and archive reward signals. It is a process guide, not a recipe book.
IMPORTANT:
- The reward function lives in
starter_kit/navigation*/vbot/vbot_*_np.py→_compute_reward().- Reward weights are in
starter_kit/navigation*/vbot/cfg.py→RewardConfig.scalesdict.- Current reward component details, default values, and search ranges are documented in
starter_kit_docs/{task-name}/Task_Reference.md.- Anti-laziness mechanisms (conditional alive_bonus, time_decay, successful truncation) are active. Do NOT remove them.
This skill does NOT contain reward component examples or scale tables. Those live in their respective locations:
What Where Component reference & scale ranges starter_kit_schedule/templates/reward_config_template.yamlArchived reward/penalty instances starter_kit_schedule/reward_library/Terrain strategies & reward code quadruped-competition-tutorskillStage-specific reward overrides curriculum-learningskillReward weight search spaces hyperparameter-optimizationskillVisual reward debugging subagent-copilot-cliskill
| Situation | Use This |
|---|---|
| "I need a new reward idea" | ✅ Follow the Discovery Process |
| "This reward isn't working, what now?" | ✅ Follow Diagnostic Methodology |
| "I want to compare two reward designs" | ✅ Follow Experiment Protocol |
| "I found a good reward, where to save it?" | ✅ Follow Archiving Process |
| "What are the reward scale ranges?" | ❌ Read reward_config_template.yaml |
| "What reward code exists for stairs?" | ❌ Read quadruped-competition-tutor |
| "How do I tune reward weights automatically?" | ❌ Read hyperparameter-optimization |
Reward engineering is iterative. Every change follows this cycle:
┌──────────────┐
│ DIAGNOSE │ ← What behavior is wrong?
└──────┬───────┘
▼
┌──────────────┐
│ HYPOTHESIZE │ ← What reward signal could fix it?
└──────┬───────┘
▼
┌──────────────┐
│ IMPLEMENT │ ← Minimal change, one variable at a time
└──────┬───────┘
▼
┌──────────────┐
│ TEST │ ← Short run (1-2M steps), multiple seeds
└──────┬───────┘
▼
┌──────────────┐
│ EVALUATE │ ← Did the hypothesis hold?
└──────┬───────┘
▼
┌──────────────┐
│ ARCHIVE │ ← Record result in reward library
└──────┬───────┘
│
▼
Next cycle
Rule: Never change more than one reward dimension per cycle. If you change both the termination penalty AND add a new gait reward, you cannot attribute outcomes.
Before touching rewards, identify what behavior is wrong. Not "the reward is too low" but a concrete observable:
| Observable | Likely Reward Gap |
|---|---|
| Robot doesn't move | Missing or weak positive incentive |
| Robot moves but falls | Missing or weak stability penalty |
| Robot oscillates near goal | Reward gradient too steep near target |
| Robot takes bizarre paths | Reward hacking — high reward from unintended behavior |
| Robot crouches/crawls | Missing height maintenance signal |
| Robot ignores obstacles | Missing proximity/collision signal |
| Robot is fast but jerky | Missing smoothness penalty |
| Robot is stable but slow | Positive incentive too weak relative to penalties |
| Reward curve plateaus | Reward provides no gradient in current state region |
| Robot stands still near target | alive_bonus accumulation > goal reward — see Lazy Robot Case Study below |
| Distance increases during training | Reward hacking via per-step bonus. Check alive_bonus × avg_ep_len vs arrival_bonus |
| Episode length near max, reached% drops | Robot exploiting per-step rewards instead of completing task |
# 1. Watch the policy — ALWAYS start here before looking at numbers
uv run scripts/play.py --env <env-name>
# 2. Train with rendering to see behavior in real time
uv run scripts/train.py --env <env-name> --render
# 3. TensorBoard for reward curves
uv run tensorboard --logdir runs/<env-name>
Use subagent-copilot-cli to analyze simulation frames and training curves:
# Describe what you see, ask what reward signal is missing
copilot --model gpt-4.1 --allow-all -p "Watch this simulation frame. The robot is <describe behavior>. What reward signal might cause this?" -s
Key insight: A reward signal is "missing" if the agent has no gradient pointing toward the desired behavior in its current state. The fix may be a new reward, a penalty, or reshaping an existing one.
A testable reward hypothesis has three parts:
Template:
"If I add/modify
<signal>with weight<w>, the robot should<desired behavior>, but might also<risk>."
When you don't know what to try, use these strategies to generate candidates:
Take the undesired behavior and directly penalize it.
Robot bouncing → penalize vertical velocity Robot spinning → penalize angular velocity Robot retreating → penalize backward displacement
If the robot is stuck, the reward surface is flat in its current region. Add a signal that creates local gradient:
Robot stuck far from goal → Add distance-based shaping (sigmoid, exponential) Robot stuck near goal → Add fine-grained proximity bonus Robot stuck on terrain edge → Add progress checkpoints
Break the competition score into component sub-goals and create a signal for each:
Final score = traversal + bonus zones + time bonus → Create separate signals for: forward progress, zone proximity, speed
What would a real quadruped "want" in this situation?
Stairs → lift knees higher Uneven ground → keep center of mass low Obstacles → slow down, increase awareness
Temporarily remove one existing reward and see what degrades:
# Remove component to see its effect
python scripts/train.py --env <env> --seed 42 --cfg-override "reward_config.scales.<component>=0.0"
If removing a component doesn't change behavior, it was irrelevant. If behavior collapses, it was critical.
Compare training reward to competition scoring rules. Gaps indicate missing signals:
Competition awards points for stopping in smiley zones → but training reward only rewards forward velocity → mismatch: need a "stop in zone" signal
Refer to
quadruped-competition-tutorskill for competition scoring rules.
Check previously tried components in the reward library before inventing new ones:
# Browse archived reward components
Get-ChildItem starter_kit_schedule/reward_library/components/ | Select-Object Name
# Read a specific component's notes
Get-Content starter_kit_schedule/reward_library/components/<name>.yaml
reward_config_template.yaml for components that can be enabled/disabled before writing new code| Change Type | Location |
|---|---|
| Adjust existing weight | starter_kit/{task}/vbot/cfg.py → RewardConfig.scales dict |
| Add new reward term | starter_kit/{task}/vbot/vbot_*_np.py → _compute_reward() |
| Configure component | starter_kit_schedule/templates/reward_config_template.yaml |
When adjusting weights, use multiplicative steps not additive:
For new components, start with a weight that produces reward magnitude comparable to existing dominant terms (check reward_breakdown logs).
NEVER iterate manually with
train.py, changing one reward weight, running, reading TensorBoard, killing, repeating. This is manual one-at-a-time search — slow, error-prone, and wasteful. ALWAYS useautoml.pyfor batch reward hypothesis testing.
The correct workflow:
REWARD_SEARCH_SPACE (in automl.py)automl.py --hp-trials 8+ to test multiple configurations in one batchstarter_kit_log/automl_*/report.mdExample: Testing near_target_speed activation radius
# In automl.py REWARD_SEARCH_SPACE:
"near_target_speed": {"type": "uniform", "low": -2.0, "high": -0.1},
"near_target_activation": {"type": "choice", "values": [0.3, 0.5, 1.0, 2.0]},
Then run: uv run starter_kit_schedule/scripts/automl.py --mode stage --hp-trials 8
Why AutoML is better:
train.py is acceptable for testing ONLY when:--max-env-steps 200000 to verify new reward code compiles--render to watch behavior qualitatively_compute_reward() (test compilation first, then use automl)# Record this BEFORE running the experiment