Name: Reward Penalty Engineering
Author: mzqef

Reward Penalty Engineering | Skills Pool

This skill does NOT contain reward component examples or scale tables. Those live in their respective locations:

What Where
Component reference & scale ranges starter_kit_schedule/templates/reward_config_template.yaml
Archived reward/penalty instances starter_kit_schedule/reward_library/
Terrain strategies & reward code quadruped-competition-tutor skill
Stage-specific reward overrides curriculum-learning skill
Reward weight search spaces hyperparameter-optimization skill
Visual reward debugging subagent-copilot-cli skill

Situation	Use This
"I need a new reward idea"	✅ Follow the Discovery Process
"This reward isn't working, what now?"	✅ Follow Diagnostic Methodology
"I want to compare two reward designs"	✅ Follow Experiment Protocol
"I found a good reward, where to save it?"	✅ Follow Archiving Process
"What are the reward scale ranges?"	❌ Read `reward_config_template.yaml`
"What reward code exists for stairs?"	❌ Read `quadruped-competition-tutor`
"How do I tune reward weights automatically?"	❌ Read `hyperparameter-optimization`

    ┌──────────────┐
    │   DIAGNOSE   │ ← What behavior is wrong?
    └──────┬───────┘
           ▼
    ┌──────────────┐
    │  HYPOTHESIZE  │ ← What reward signal could fix it?
    └──────┬───────┘
           ▼
    ┌──────────────┐
    │   IMPLEMENT   │ ← Minimal change, one variable at a time
    └──────┬───────┘
           ▼
    ┌──────────────┐
    │     TEST      │ ← Short run (1-2M steps), multiple seeds
    └──────┬───────┘
           ▼
    ┌──────────────┐
    │   EVALUATE    │ ← Did the hypothesis hold?
    └──────┬───────┘
           ▼
    ┌──────────────┐
    │   ARCHIVE     │ ← Record result in reward library
    └──────┬───────┘
           │
           ▼
      Next cycle

Observable	Likely Reward Gap
Robot doesn't move	Missing or weak positive incentive
Robot moves but falls	Missing or weak stability penalty
Robot oscillates near goal	Reward gradient too steep near target
Robot takes bizarre paths	Reward hacking — high reward from unintended behavior
Robot crouches/crawls	Missing height maintenance signal
Robot ignores obstacles	Missing proximity/collision signal
Robot is fast but jerky	Missing smoothness penalty
Robot is stable but slow	Positive incentive too weak relative to penalties
Reward curve plateaus	Reward provides no gradient in current state region
Robot stands still near target	alive_bonus accumulation > goal reward — see Lazy Robot Case Study below
Distance increases during training	Reward hacking via per-step bonus. Check alive_bonus × avg_ep_len vs arrival_bonus
Episode length near max, reached% drops	Robot exploiting per-step rewards instead of completing task

# 1. Watch the policy — ALWAYS start here before looking at numbers
uv run scripts/play.py --env <env-name>

# 2. Train with rendering to see behavior in real time
uv run scripts/train.py --env <env-name> --render

# 3. TensorBoard for reward curves
uv run tensorboard --logdir runs/<env-name>

# Describe what you see, ask what reward signal is missing
copilot --model gpt-4.1 --allow-all -p "Watch this simulation frame. The robot is <describe behavior>. What reward signal might cause this?" -s

# Remove component to see its effect
python scripts/train.py --env <env> --seed 42 --cfg-override "reward_config.scales.<component>=0.0"

# Browse archived reward components
Get-ChildItem starter_kit_schedule/reward_library/components/ | Select-Object Name
# Read a specific component's notes
Get-Content starter_kit_schedule/reward_library/components/<name>.yaml

Change Type	Location
Adjust existing weight	`starter_kit/{task}/vbot/cfg.py` → `RewardConfig.scales` dict
Add new reward term	`starter_kit/{task}/vbot/vbot_*_np.py` → `_compute_reward()`
Configure component	`starter_kit_schedule/templates/reward_config_template.yaml`

# In automl.py REWARD_SEARCH_SPACE:
"near_target_speed": {"type": "uniform", "low": -2.0, "high": -0.1},
"near_target_activation": {"type": "choice", "values": [0.3, 0.5, 1.0, 2.0]},

# Record this BEFORE running the experiment

Reward Penalty Engineering

Purpose

Reward Penalty Engineering

Purpose

When to Use This Skill

The Exploration Cycle

Phase 1: Diagnose

Behavioral Signals

Diagnostic Commands

Visual Diagnosis

Phase 2: Hypothesize

Formulating a Good Hypothesis

Discovery Strategies

Strategy 1: Inversion

Strategy 2: Shaping the Gradient

Strategy 3: Proxy Decomposition

Strategy 4: Biomimetic Analogy

Strategy 5: Ablation Discovery

Strategy 6: Competition-Score Alignment

Strategy 7: Browse the Library

Phase 3: Implement

Principles

Where to Make Changes

Change Magnitude Guidelines

Phase 4: Test

🔴 AutoML-First Testing (MANDATORY)

Exception: `train.py` is acceptable for testing ONLY when:

Experiment Protocol

Openai Whisper

Voice Call

Prose

Clawhub

Sherpa Onnx Tts

Openai Whisper Api

What	Where
Component reference & scale ranges	`starter_kit_schedule/templates/reward_config_template.yaml`
Archived reward/penalty instances	`starter_kit_schedule/reward_library/`
Terrain strategies & reward code	`quadruped-competition-tutor` skill
Stage-specific reward overrides	`curriculum-learning` skill
Reward weight search spaces	`hyperparameter-optimization` skill
Visual reward debugging	`subagent-copilot-cli` skill

Reward Penalty Engineering

Purpose

Reward Penalty Engineering

Purpose

When to Use This Skill

The Exploration Cycle

Phase 1: Diagnose

Behavioral Signals

Diagnostic Commands

Visual Diagnosis

Phase 2: Hypothesize

Formulating a Good Hypothesis

Discovery Strategies

Strategy 1: Inversion

Strategy 2: Shaping the Gradient

Strategy 3: Proxy Decomposition

Strategy 4: Biomimetic Analogy

Strategy 5: Ablation Discovery

Strategy 6: Competition-Score Alignment

Strategy 7: Browse the Library

Phase 3: Implement

Principles

Where to Make Changes

Change Magnitude Guidelines

Phase 4: Test

🔴 AutoML-First Testing (MANDATORY)

Exception: train.py is acceptable for testing ONLY when:

Experiment Protocol

Openai Whisper

Voice Call

Prose

Clawhub

Sherpa Onnx Tts

Openai Whisper Api

Exception: `train.py` is acceptable for testing ONLY when: