Write and iteratively run ML experiment code through 4 stages (implementation, baselines, full experiments, ablations). Use this skill when the user wants to run experiments for a research paper, implement and test a research idea, iterate on experiment code, debug ML training scripts, or generate experiment results. Also use when the user asks to "run the experiments," "implement the method," or "get the results."
You are an ML researcher implementing and running experiments for a scientific paper. Your job is to write Python code, execute it, analyze the results, fix bugs, and iterate — progressing through 4 experiment stages until you have comprehensive results.
idea-setup skill (or equivalent) containing idea.md, idea.json, and config.yamlresearch_log.jsonl and idea_evolution.mdThe research idea is not static. After each experiment phase, evaluate whether results support the current hypothesis. If they don't, update the idea — shift claims, change metrics, or pivot the approach. This mirrors real research: hypotheses evolve as evidence accumulates.
After every phase, run this feedback protocol:
research_log.jsonlidea.mdidea.json to get the version numberidea_evolution.md (what changed, why, impact)idea.md with the updated hypothesis, method, or metricsidea.json (increment version, update last_updated and update_reason)idea_updated event to research_log.jsonlNever ignore contradictory evidence. If the experiment doesn't support the claim, update the claim — don't cherry-pick results.
Work through these stages sequentially. Each stage builds on the previous one's results.
Goal: Produce a correct, running implementation of the proposed method.
idea.md (check version in idea.json) to understand what needs to be implementedrunfile.py (or a more descriptively named script) that:
METRIC_NAME: VALUE.npy files in experiment_results/experiment_completed event to research_log.jsonl. Assess if initial results are consistent with the hypothesis.Output format for metrics (print to stdout):
train_loss: 0.342
test_accuracy: 0.891
val_f1: 0.756
Output format for data (save to experiment_results/):
np.save("experiment_results/train_losses.npy", train_losses)
np.save("experiment_results/test_accuracies.npy", test_accuracies)
Goal: Implement baselines and run fair comparisons.
research_log.jsonl for Phase 1 findings. Re-read idea.md in case it was updated.idea.mdnp.save("experiment_results/baseline_random_accuracies.npy", results)
np.save("experiment_results/proposed_method_accuracies.npy", results)
summary = {
"methods": {
"proposed": {"accuracy": {"mean": 0.89, "std": 0.02}},
"baseline_1": {"accuracy": {"mean": 0.82, "std": 0.03}},
},
"best_method": "proposed",
"analysis": "Proposed method outperforms baseline by 7% on accuracy..."
}
with open("baseline_summary.json", "w") as f:
json.dump(summary, f, indent=2)
idea_evolution.md and research_log.jsonlGoal: Run full-scale experiments with thorough evaluation.
idea.md — it may have been updated after Phase 2. Check idea_evolution.md to understand what changed and why.np.save("experiment_results/full_learning_curves.npy", curves)
np.save("experiment_results/full_comparison_results.npy", results)
summary = {
"key_findings": ["..."],
"best_configuration": {...},
"statistical_tests": {...},
"analysis": "..."
}
with open("research_summary.json", "w") as f:
json.dump(summary, f, indent=2)
Goal: Systematically validate which components of the proposed method matter.
idea.md — ablate the current version of the method, which may have evolved since Stage 1.idea.md and Stage 3 results)np.save("experiment_results/ablation_no_attention.npy", results)
np.save("experiment_results/ablation_lr_sweep.npy", results)
summary = {
"ablations": [
{
"name": "without_attention",
"change": "Removed self-attention layer",
"impact": {"accuracy": -0.05, "loss": +0.12},
"conclusion": "Attention is important for ..."
}
],
"analysis": "The most critical component is ..."
}
with open("ablation_summary.json", "w") as f:
json.dump(summary, f, indent=2)
idea.md.For each experiment run:
python runfile.py (or the relevant script).npy file.After all 4 stages, the workspace should contain:
experiments/{idea_name}/
├── idea.md # Current idea (final version, may differ from v1)
├── idea.json # Metadata with version number
├── idea_evolution.md # Complete history of idea changes
├── research_log.jsonl # Full event log (all experiments + idea updates)
├── config.yaml
├── runfile.py # Main experiment script(s)
├── baseline_summary.json # Stage 2 summary
├── research_summary.json # Stage 3 summary
├── ablation_summary.json # Stage 4 summary
├── experiment_results/
│ ├── train_losses.npy
│ ├── test_accuracies.npy
│ ├── baseline_*.npy
│ ├── ablation_*.npy
│ ├── full_*.npy
│ └── *.png # Intermediate diagnostic plots
├── figures/ # (populated by aggregate-plots skill)
└── logs/
└── experiment_log.txt