Run tournament game theory experiments with PPO or gradient methods. Use when the user wants to run experiments, train agents, set up custom parameter sweeps, analyze convergence, or asks about run_two_players.py, experiment configuration, or training commands.
| Experiment | Script | Config | Description |
|---|---|---|---|
| Two-Player Symmetric | run/run_two_players.py | config/one_stage_two_players.py | Identical players (k1=k2, l1=l2) |
| Different Cost | run/run_different_cost.py | config/one_stage_different_cost.py | Asymmetric costs (k1 < k2, l1=l2) |
| Different Ability | run/run_different_ability.py | config/one_stage_different_ability.py | Asymmetric abilities (k1=k2, l1 > l2) |
| Three Players | run/run_three_players.py | config/one_stage_three_players.py | Three identical players |
# PPO Training (Recommended)
python run/run_two_players.py --method ppo --q 40 --episodes 2048000 --seed 42
# Gradient Baseline
python run/run_two_players.py --method gradient --q 40
# Gradient baseline (l1=10, l2=5 by default)
python run/run_different_ability.py --method gradient --q 40
# PPO training
python run/run_different_ability.py --method ppo --q 40 --episodes 2048000 --seed 42
# Custom ability parameters
python run/run_different_ability.py --method ppo --q 40 --l1 15 --l2 5
# Gradient baseline (k1=0.0004, k2=0.00055 by default)
python run/run_different_cost.py --method gradient --q 40
# PPO training
python run/run_different_cost.py --method ppo --q 40 --episodes 2048000 --seed 42
| Argument | Default | Description |
|---|---|---|
--method | ppo | Algorithm: ppo or gradient |
--q | (sweeps all) | Noise parameter (single value) |
--episodes | 2048000 | Total environment steps |
--seed | 42 | Random seed |
| Argument | Default | Description |
|---|---|---|
--theory-align-v2 | True | Mean+concentration policy head |
--enable-convergence-eval | True | Early stopping on convergence |
--cheap-gate-profile | relaxed | KL threshold profile |
| Argument | Default | Description |
|---|---|---|
--exploit-every-updates | 10 | Max interval between exploitability evaluations |
--disable-cheap-gate | False | Gate always ON: exploitability eval eligible every update |
--disable-exploitability | False | Never evaluate exploitability; converge on effort gap only |
Cheap Gate: 决定何时触发 exploitability 检查的"门控"机制,基于 KL divergence 和 policy drift 是否稳定。
Exploitability: 衡量当前策略的 ε-Nash 近似程度。如果对手可以通过单方面偏离获得超过 ε 的收益,则策略尚未收敛。
# 每5个update评估一次exploitability,禁用cheap gate门控
python run/run_two_players.py --method ppo --q 40 \
--exploit-every-updates 5 --disable-cheap-gate
# 完全禁用exploitability评估(仅基于effort gap收敛)
python run/run_two_players.py --method ppo --q 40 --disable-exploitability
# Disable theory alignment
python run/run_two_players.py --method ppo --no-theory-align-v2
# Disable convergence evaluation
python run/run_two_players.py --method ppo --no-convergence-eval
Configuration lives in config/one_stage_two_players.py. Key parameters:
config = {
# Game parameters
"k": 0.0004, # Quadratic cost coefficient
"w_h": 6.5, # High prize
"w_l": 3.0, # Low prize
"q_list": [25.0, 40.0, 55.0], # Noise values to sweep
# PPO hyperparameters
"steps_per_update": 4096,
"minibatch_size": 1024,
"update_epochs": 6,
"episodes": 2_048_000,
# Learning rate schedule
"lr_start": 3e-4,
"lr_end": 2e-4,
# Entropy schedule
"entropy_coef_start": 0.03,
"entropy_coef_end": 0.015,
# Convergence settings
"convergence": {
"enabled": True,
"cheap_gate_profile": "relaxed",
}
}
e* = (w_h - w_l) / (4 * k * q)
Examples with default w_h=6.5, w_l=3.0, k=0.0004:
Model: y_i = e_i + l_i + ε_i where l1 > l2
e* = ((2q - (l1 - l2)) * (w_h - w_l)) / (8 * k * q²)
Both players exert same effort at equilibrium; player 1 wins more often due to ability advantage.
Examples with l1=10, l2=5, k=0.0004, w_h=6.5, w_l=3.0:
e1* = 2 k2 q (w_H - w_L) / (8 k1 k2 q² - (k1 - k2)(w_H - w_L))
e2* = 2 k1 q (w_H - w_L) / (8 k1 k2 q² - (k1 - k2)(w_H - w_L))
Player with lower cost (k1) exerts more effort at equilibrium.
python run/run_two_players.py --method ppo --q 40 --seed 50 \
--episodes 4096000
## Output Files
### Two-Player Symmetric
| Output | Location |
|--------|----------|
| Convergence JSON | `results/convergence_history/{method}_q{q}_seed{seed}_{ablation}_convergence.json` |
| Results CSV | `results/one_stage_two_players_v2.csv` |
| Training logs | `results/logs/one_stage_two_players_*.log` |
### Different Ability
| Output | Location |
|--------|----------|
| Convergence JSON | `results/convergence_history/different_ability_{method}_q{q}_convergence.json` |
| Results CSV | `results/different_ability_two_players.csv` |
| Training logs | `results/logs/different_ability_*.log` |
### Different Cost
| Output | Location |
|--------|----------|
| Convergence JSON | `results/convergence_history/different_cost_{method}_q{q}_convergence.json` |
| Results CSV | `results/different_cost_two_players.csv` |
| Training logs | `results/logs/different_cost_*.log` |
### Convergence JSON Structure
```json
{
"config": { "q": 40.0, "seed": 42, ... },
"history": {
"effort_agent1": [50.1, 51.2, ...],
"effort_agent2": [49.8, 51.0, ...],
"kl_divergence": [0.01, 0.008, ...],
"update_idx": [0, 1, 2, ...]
},
"final": {
"theoretical_effort": 54.69,
"final_effort": 54.2,
"gap": 0.49
}
}
# Multi-algorithm comparison
python tools/plot_convergence.py
# Detailed per-agent plots
python tools/plot_convergence_detailed.py --algorithm PPO --q 25.0
Verify setup:
python tools/verify_rollout_modes.py
Run experiment:
python run/run_two_players.py --method ppo --q 40 --seed 42
Check convergence:
python tools/plot_convergence_detailed.py --q 40.0
run_ppo(..., ablation_name="my_ablation")
results/convergence_history/| Profile | Use Case |
|---|---|
relaxed | Default, tolerates higher KL variance |
default | Standard thresholds |
conservative | Stricter convergence criteria |
aggressive | Fast early stopping |
python run/run_two_players.py --method ppo --cheap-gate-profile conservative
lr_start/lr_end in config--cheap-gate-profile conservative--episodes 4096000entropy_coef_* in config--seed for reproducible resultsFor detailed implementation: