This skill should be used when the user asks to "design experiments", "plan experiments", "how many runs do I need", "which baselines should I use", "plan ablations", "power analysis", "how many seeds", or after hypothesis formulation and before running any experiments. Pre-experiment planning with baselines, ablations, sample size, resource estimation, and execution ordering.
Pre-experiment planning that translates hypotheses into a concrete, executable experiment plan with baselines, ablations, sample size, resource estimation, and execution ordering.
Core Features
1. Baseline Selection
Select and justify comparison baselines:
Trivial baseline: Random chance, majority class, or simplest heuristic
Standard baseline: Most common method in the field
SOTA baseline: Best published result on the benchmark
Ablation baseline: Proposed method minus the key component
Fairness checklist: Same preprocessing, splits, hyperparameter budget
2. Ablation Planning
Design ablation studies to isolate component contributions:
Component identification: Which parts of the method are novel?
Ablation ordering: Which components to remove first (most to least important)
: Predicted effect of each ablation (for hypothesis validation)
Verwandte Skills
Expected impact
Interaction effects: Which components might interact?
Causal Claim Ablation Requirements
For every mechanistic claim the paper intends to make (e.g., "freezing attention heads preserves syntactic knowledge"):
Identify confounds: What else changes when you apply this intervention? (parameter count, compute, capacity, regularization effect)
Design parameter-matched controls: If intervention A has different trainable parameters than baseline B, add a control C that matches A's parameter count but changes a different component.
Design component isolation ablations: If claiming component X is special, test:
Freeze X only
Freeze everything EXCEPT X
Freeze a random subset of the same size as X
Pre-register the ablation logic: Before running experiments, document which ablation supports which claim.
Ablation completeness check (include in experiment plan):
Claim
Required ablation
Designed?
Confounds addressed?
[claim 1]
[ablation]
[yes/no]
[list]
2b. Dataset Representativeness Rule
If the paper makes claims about a CATEGORY of tasks (e.g., "syntactic tasks", "semantic tasks", "reasoning tasks"):
Minimum 2 datasets per category, ideally 3.
Datasets within a category must differ in at least one of: size, domain, metric, or difficulty.
If only 1 dataset per category is feasible (compute budget), the paper MUST:
Frame claims as "on [dataset name]" not "for [task type]".
Explicitly state this as a limitation in the experiment plan.
Not use category-level language in title or abstract.
If claims are about specific benchmarks (not categories), 1 dataset per benchmark is fine.
Confound checklist (mandatory per experimental factor):
For each factor you claim to study (e.g., "task type"), list:
What other variables co-vary with this factor? (dataset size, metric type, label distribution, sequence length)
Can you separate the factor of interest from these confounds?
If not, what is the honest scope of your claim?
Include the confound checklist as a table in the experiment plan.
2c. Motivation-Metric Alignment
If the introduction or motivation mentions ANY of these as benefits:
Then the experiment plan MUST include measurements of:
Wall-clock training time (not just parameter count)
Peak GPU memory usage
Throughput (examples/second)
At minimum, report these for the main comparison conditions.
Parameter count alone is NOT an efficiency metric — identical parameter counts can have very different compute costs.
3. Sample Size & Seeds
Determine the number of runs needed:
Convention-based: Community standard for the benchmark (e.g., "5 seeds is standard for BCI-IV-2a")
ALETHEIA default: Plan 5 seeds per condition for full runs unless power analysis or the venue requires otherwise; avoid 10 seeds by default in experiment-plan.md (reduces GPU waste and aligns with rules/compute-budget.md).
Power analysis (optional, see caveat below): When prior effect size and variance are available
6. Compute Requirements (estimated at design time)
Include in the experiment plan:
Resource
Estimate
GPU type needed
[minimum VRAM, recommended type]
Per-run time
[estimated minutes]
Total runs
[conditions x seeds]
Total GPU-hours
[per-run x total runs]
Storage
[dataset size + checkpoints + outputs]
Feasibility check: Can this experiment be completed within available resources? If not, which conditions should be prioritized or cut?
This ensures resource constraints are considered during design, not discovered later at /plan-compute.
7. Expected Results (mandatory per hypothesis)
For each hypothesis, document BEFORE running experiments:
If H[N] is TRUE:
Expected metric values: [specific numbers or ranges]
Expected patterns: [what the data should look like]
Expected effect size: [Cohen's d or similar]
If H[N] is FALSE:
Expected metric values: [what you'd see instead]
Expected patterns: [alternative explanation]
What this would mean for the contribution: [implications]
This forces pre-commitment to outcome interpretation and prevents post-hoc rationalization.
Input Modes
Mode A: Pipeline (from predecessor)
Hypotheses -- from hypothesis-formulation output (hypotheses.md)
Available resources -- GPU hours, datasets, time constraints
Target venue (optional) -- for calibrating experiment thoroughness
Mode B: Standalone (manual)
Research goal -- user describes what they want to test in free text
Method description -- user describes their approach
Available resources -- user specifies compute, data, time constraints
The skill reconstructs implicit hypotheses from the description before designing experiments, and notes: "No hypotheses.md found. I've inferred the following testable hypotheses from your description -- please confirm before proceeding."
When running in Mode B, state: "No hypotheses.md found. I've inferred the following testable hypotheses from your description -- please confirm before proceeding."
Outputs
experiment-plan.md containing:
Baselines: Selected baselines with justification for each
Ablations: Components to ablate, with justification for each
Datasets & splits: Which datasets, how to split, cross-validation strategy
Metrics: Primary metric, secondary metrics, with justification
Sample size / runs: Number of seeds, subjects, folds
Power analysis (optional, see caveat below)
Resource estimate: Estimated GPU hours, storage, wall time
Execution order: Which experiments to run first (quick validation before full sweep)
Checkpoints: Decision points (stop-or-go after each experiment block)
Initial experiment-state.json (see Iteration Loop State section)
Power Analysis: Optional with Explicit Caveat
Power analysis is included only when the user provides or the skill can estimate:
Expected effect size (from prior work or pilot data)
Variance estimate (from prior work or pilot data)
Desired significance level and power
When parameters are available: Compute the recommended sample size and include the calculation.
When parameters are NOT available (the common case):
State: "Power analysis skipped -- no prior effect size or variance estimates available."
Use a convention-based default instead (e.g., "5 seeds is standard for this benchmark; 3 seeds minimum for a quick validation pass").
Flag this as a limitation: "Sample size is based on community convention, not statistical power. If the effect is small, more runs may be needed."
The skill must NEVER silently assume effect size parameters and present a power analysis as if it were well-grounded. Assumed parameters must be explicitly marked: "ASSUMED: effect size d=0.5 (medium, no prior data). This power analysis is illustrative only."
Iteration Loop State
experiment-state.json
On first run, create experiment-state.json in the project root: