Name: Experiment Eval
Author: autoresearch-trading

Experiment Eval Protocol

Define what success looks like BEFORE running an experiment. Grade the result AFTER.

When to Use

Before any autoresearch experiment (integrated into the Design step)
When the user asks "what would success look like?"
After an experiment completes, to decide KEEP/DISCARD/INVESTIGATE

Pre-Experiment: Define Eval

Before running, write an eval block in the experiment plan (docs/experiments/):

## Eval Definition

**Hypothesis:** [what we expect to happen and why]

**Control:** [baseline values — PCA or random encoder baseline]

**Success criteria (ALL must pass):**
- [ ] Accuracy at primary horizon >= [threshold] on 15+/25 symbols
- [ ] Beat baseline by >= [margin] pp (representation quality)
- [ ] CKA across seeds > 0.7 (representation stability)
- [ ] No embedding collapse (effective rank > 10)

**Failure indicators (ANY triggers DISCARD):**
- [ ] Accuracy < 50.5% mean across symbols (no signal)
- [ ] Fewer than 10/25 symbols above 51% (not universal)
- [ ] Symbol identity probe > 30% (memorizing symbols, not microstructure)
- [ ] Embedding collapse (effective rank < 5)

**Ambiguity zone (triggers INVESTIGATE):**
- [ ] Accuracy between [baseline, baseline+0.5pp] (noise vs real improvement)
- [ ] Symbol coverage changes by 1-2 symbols (sampling variance)

Experiment Eval Protocol

Define what success looks like BEFORE running an experiment. Grade the result AFTER.

When to Use

Before any autoresearch experiment (integrated into the Design step)
When the user asks "what would success look like?"
After an experiment completes, to decide KEEP/DISCARD/INVESTIGATE

Pre-Experiment: Define Eval

Before running, write an eval block in the experiment plan (docs/experiments/):

## Eval Definition

**Hypothesis:** [what we expect to happen and why]

**Control:** [baseline values — PCA or random encoder baseline]

**Success criteria (ALL must pass):**
- [ ] Accuracy at primary horizon >= [threshold] on 15+/25 symbols
- [ ] Beat baseline by >= [margin] pp (representation quality)
- [ ] CKA across seeds > 0.7 (representation stability)
- [ ] No embedding collapse (effective rank > 10)

**Failure indicators (ANY triggers DISCARD):**
- [ ] Accuracy < 50.5% mean across symbols (no signal)
- [ ] Fewer than 10/25 symbols above 51% (not universal)
- [ ] Symbol identity probe > 30% (memorizing symbols, not microstructure)
- [ ] Embedding collapse (effective rank < 5)

**Ambiguity zone (triggers INVESTIGATE):**
- [ ] Accuracy between [baseline, baseline+0.5pp] (noise vs real improvement)
- [ ] Symbol coverage changes by 1-2 symbols (sampling variance)

Gate	Threshold	What It Tests
0	PCA + random encoder baselines	Reference (no pass/fail)
1	Linear probe > 51.4% on 15+/25 symbols	Frozen representation quality
2	Fine-tuned > logistic regression by >= 0.5pp	Value of pretraining
3	AVAX (held out) > 51.4%	Universality
4	Temporal stability < 3pp drop	Robustness

Experiment Eval

Experiment Eval Protocol

When to Use

Pre-Experiment: Define Eval

Experiment Eval

Experiment Eval Protocol

When to Use

Pre-Experiment: Define Eval

Evaluation Gates (pre-registered)

Post-Experiment: Grade Result

Grading output format

Integration with Autoresearch

Openclaw Release Maintainer

Verify

Flow

Fix

Hygiene

Add Policy