Name: Running Experiments
Author: GSU-FrankJ

스킬 검색.../

Running Experiments | Skills Pool

# PPO Training (Recommended)
python run/run_two_players.py --method ppo --q 40 --episodes 2048000 --seed 42

# Gradient Baseline
python run/run_two_players.py --method gradient --q 40

# Gradient baseline (l1=10, l2=5 by default)
python run/run_different_ability.py --method gradient --q 40

# PPO training
python run/run_different_ability.py --method ppo --q 40 --episodes 2048000 --seed 42

# Custom ability parameters
python run/run_different_ability.py --method ppo --q 40 --l1 15 --l2 5

# Gradient baseline (k1=0.0004, k2=0.00055 by default)
python run/run_different_cost.py --method gradient --q 40

# PPO training
python run/run_different_cost.py --method ppo --q 40 --episodes 2048000 --seed 42

Argument	Default	Description
`--exploit-every-updates`	10	Max interval between exploitability evaluations
`--disable-cheap-gate`	False	Gate always ON: exploitability eval eligible every update
`--disable-exploitability`	False	Never evaluate exploitability; converge on effort gap only

# 每5个update评估一次exploitability，禁用cheap gate门控
python run/run_two_players.py --method ppo --q 40 \
    --exploit-every-updates 5 --disable-cheap-gate

# 完全禁用exploitability评估（仅基于effort gap收敛）
python run/run_two_players.py --method ppo --q 40 --disable-exploitability

# Disable theory alignment
python run/run_two_players.py --method ppo --no-theory-align-v2

# Disable convergence evaluation
python run/run_two_players.py --method ppo --no-convergence-eval

config = {
    # Game parameters
    "k": 0.0004,          # Quadratic cost coefficient
    "w_h": 6.5,           # High prize
    "w_l": 3.0,           # Low prize
    "q_list": [25.0, 40.0, 55.0],  # Noise values to sweep
    
    # PPO hyperparameters
    "steps_per_update": 4096,
    "minibatch_size": 1024,
    "update_epochs": 6,
    "episodes": 2_048_000,
    
    # Learning rate schedule
    "lr_start": 3e-4,
    "lr_end": 2e-4,
    
    # Entropy schedule
    "entropy_coef_start": 0.03,
    "entropy_coef_end": 0.015,
    
    # Convergence settings
    "convergence": {
        "enabled": True,
        "cheap_gate_profile": "relaxed",
    }
}

e* = (w_h - w_l) / (4 * k * q)

e* = ((2q - (l1 - l2)) * (w_h - w_l)) / (8 * k * q²)

e1* = 2 k2 q (w_H - w_L) / (8 k1 k2 q² - (k1 - k2)(w_H - w_L))
e2* = 2 k1 q (w_H - w_L) / (8 k1 k2 q² - (k1 - k2)(w_H - w_L))

python run/run_two_players.py --method ppo --q 40 --seed 50 \
    --episodes 4096000


## Output Files

### Two-Player Symmetric
| Output | Location |
|--------|----------|
| Convergence JSON | `results/convergence_history/{method}_q{q}_seed{seed}_{ablation}_convergence.json` |
| Results CSV | `results/one_stage_two_players_v2.csv` |
| Training logs | `results/logs/one_stage_two_players_*.log` |

### Different Ability
| Output | Location |
|--------|----------|
| Convergence JSON | `results/convergence_history/different_ability_{method}_q{q}_convergence.json` |
| Results CSV | `results/different_ability_two_players.csv` |
| Training logs | `results/logs/different_ability_*.log` |

### Different Cost
| Output | Location |
|--------|----------|
| Convergence JSON | `results/convergence_history/different_cost_{method}_q{q}_convergence.json` |
| Results CSV | `results/different_cost_two_players.csv` |
| Training logs | `results/logs/different_cost_*.log` |

### Convergence JSON Structure

```json
{
  "config": { "q": 40.0, "seed": 42, ... },
  "history": {
    "effort_agent1": [50.1, 51.2, ...],
    "effort_agent2": [49.8, 51.0, ...],
    "kl_divergence": [0.01, 0.008, ...],
    "update_idx": [0, 1, 2, ...]
  },
  "final": {
    "theoretical_effort": 54.69,
    "final_effort": 54.2,
    "gap": 0.49
  }
}

# Multi-algorithm comparison
python tools/plot_convergence.py

# Detailed per-agent plots
python tools/plot_convergence_detailed.py --algorithm PPO --q 25.0

Verify setup:
```
python tools/verify_rollout_modes.py
```

Run experiment:

python run/run_two_players.py --method ppo --q 40 --seed 42

Check convergence:

python tools/plot_convergence_detailed.py --q 40.0

python run/run_two_players.py --method ppo --cheap-gate-profile conservative

Experiment	Script	Config	Description
Two-Player Symmetric	`run/run_two_players.py`	`config/one_stage_two_players.py`	Identical players (k1=k2, l1=l2)
Different Cost	`run/run_different_cost.py`	`config/one_stage_different_cost.py`	Asymmetric costs (k1 < k2, l1=l2)
Different Ability	`run/run_different_ability.py`	`config/one_stage_different_ability.py`	Asymmetric abilities (k1=k2, l1 > l2)
Three Players	`run/run_three_players.py`	`config/one_stage_three_players.py`	Three identical players

Experiment	Script	Config	Description
Two-Player Symmetric	`run/run_two_players.py`	`config/one_stage_two_players.py`	Identical players (k1=k2, l1=l2)
Different Cost	`run/run_different_cost.py`	`config/one_stage_different_cost.py`	Asymmetric costs (k1 < k2, l1=l2)
Different Ability	`run/run_different_ability.py`	`config/one_stage_different_ability.py`	Asymmetric abilities (k1=k2, l1 > l2)
Three Players	`run/run_three_players.py`	`config/one_stage_three_players.py`	Three identical players

Argument	Default	Description
`--method`	`ppo`	Algorithm: `ppo` or `gradient`
`--q`	(sweeps all)	Noise parameter (single value)
`--episodes`	2048000	Total environment steps
`--seed`	42	Random seed

Argument	Default	Description
`--theory-align-v2`	True	Mean+concentration policy head
`--enable-convergence-eval`	True	Early stopping on convergence
`--cheap-gate-profile`	`relaxed`	KL threshold profile

Profile	Use Case
`relaxed`	Default, tolerates higher KL variance
`default`	Standard thresholds
`conservative`	Stricter convergence criteria
`aggressive`	Fast early stopping

Running Experiments

Available Experiment Types

Quick Start

Running Experiments

Available Experiment Types

Quick Start

Two-Player Symmetric (Default)

Different Ability Experiment

Different Cost Experiment

Core CLI Arguments

PPO-Specific Arguments

Convergence & Exploitability Arguments

Disabling Defaults

Configuration Files

Theoretical Equilibrium

Two-Player Symmetric

Different Ability (Additive Model)

Different Cost (Asymmetric Cost)

Custom Parameter Experiments

Method 1: CLI Override

Analysis Tools

Plotting Convergence

Experiment Workflow

Standard Experiment

Ablation Study

Convergence Profiles

Common Issues

High KL Divergence

Slow Convergence

Reproducibility

Additional Resources

Clawhub

Sherpa Onnx Tts

Openai Whisper Api

Voice Call

Model Usage

Sag