Reference for designing nerd experiments — competing theories, sweep harnesses, ground truth strategies, metric selection, and feasibility checks. Use when creating or reviewing experiment plans.
Ground truth — how "correct" is defined, with circularity caveats
Theory-linked acceptance criteria — each theory can be confirmed or rejected
Theory Types
Parameter is wrong: a different value would improve the metric
Model is wrong: a different model (not just parameters) would fit better
Feature is unnecessary: removing it entirely causes no degradation
相關技能
Metric is wrong: we're optimizing the wrong thing
Data is the bottleneck: the parameter doesn't matter because the input data is the real problem
Architecture is the bottleneck: no parameter value can fix this
Ground Truth Strategies
Auto-resolved data: measures sensitivity, not absolute quality (circular)
User feedback: clicks/engagement (sparse, position-biased)
Hand-labeled: gold standard (expensive, small samples)
Synthetic: tests metric computation, not real-world quality
Feasibility Check
combos = product(range_sizes)
time = combos * data_size * cost_per_eval
if time > 1 hour: reduce ranges or use random search
Analytical vs Experimental
Not all parameters can be swept. Parameters in non-executable files (markdown agent prompts, documentation, config comments) or those requiring human judgment are analytical — they can be reasoned about with competing theories but cannot be iterated in a nerd-loop.
For analytical parameters:
Still generate competing theories and testable predictions
Recommendations come from reasoning and analogies, not measured data
Do NOT propose sweep harnesses or eval commands
Mark as experiment_type: "analytical" in the plan
For experimentable parameters:
Full sweep harness with automated metric
The metric must be: automated (shell command → number), deterministic, sensitive to changes, and fast (<5 min)
Anti-Patterns
Single-hypothesis experiments (only testing "is the parameter optimal?")
Sweeping dead code parameters (verify the hot path first)
Optimizing metrics that don't correlate with user satisfaction
No ablation (never tested if the feature matters at all)
Insufficient data (<30 points per config = noise)
No baseline (always include current values as comparison)
Proposing a nerd-loop on parameters that have no automated metric — use analytical batch analysis instead