Use this skill when the user wants to debug, diagnose, or systematically iterate on an experiment that already exists, or when they need a structured experiment log for tracking runs, hypotheses, failures, results, and next steps during active research. Apply it to underperforming methods, training that will not converge, regressions after a change, inconsistent results across datasets, aimless experimentation without progress, and questions like 'why doesn't this work?', 'no progress after many attempts', or 'how should I investigate this failure?'. Also use it for setting up practical experiment logging/record-keeping that supports debugging and iteration. Do not use it for designing a brand-new experiment pipeline or full experiment program (use experiment-pipeline), generating research ideas, fixing isolated coding/syntax errors, or writing retrospective summaries into research memory/notes/knowledge bases.
A systematic approach to running, debugging, and iterating on research experiments. The critical skill is not running more experiments — it's understanding WHY experiments fail.
This skill is typically loaded from within
experiment-pipelinewhen a stage attempt fails. After debugging, return to the pipeline's stage-gate structure to continue. Can also be used standalone for any experiment debugging.
Finding WHY experiments fail is the most critical research skill. Not analyzing results leads to two failure modes:
The goal is not to run more experiments. The goal is to run the RIGHT experiments — ones that isolate causes and test specific hypotheses.
When an experiment fails or produces unexpected results, follow these five steps:
Gather concrete examples of bad results. Look at the actual outputs, not just aggregate metrics. What specifically went wrong? Are the failures systematic or random?
You need a baseline that works. Two ways to find one:
If you can't find any working version, simplify further until something works. There is always a simple enough version that works.
Starting from the working version, incrementally add complexity until it breaks:
This step isolates the cause. Without it, you're guessing.
Based on the isolated cause from Step 3:
Based on the confirmed cause:
research-ideation skill)See references/debugging-methodology.md for detailed branching logic and a cause taxonomy.
Prioritize these rules during experimental work:
Every experiment should be logged with five sections. Use the template at assets/experiment-log-template.md.
| Section | What to Record |
|---|---|
| Purpose | Why you're running this experiment; what you expect to learn |
| Setting | Data, algorithm changes, hyperparameters — everything needed to reproduce |
| Results | Quantitative metrics + qualitative observations + specific good/failure cases |
| Analysis | Do results match expectations? If not, hypothesized causes ranked by likelihood |
| Next Steps | What to do based on the analysis — YOU are the project leader |
The "Next Steps" section is the most important. Don't wait for someone to tell you what to do next. Analyze your results and propose the next experiment yourself. This is what distinguishes a researcher from a technician.
Cross-cycle learning: If using
experiment-pipeline, your experiment logs feed intoevo-memory's ESE (Experiment Strategy Evolution) mechanism. Tag reusable strategies with[Reusable]so ESE can extract them for future cycles.
After completing the 5-step diagnostic flow, return to experiment-pipeline with:
When experiments succeed and you have a complete set of results, pass these artifacts to paper-writing:
| Artifact | Source | Used By |
|---|---|---|
| Final experiment results (tables and figures) | Experiment logs | Experiments section |
| Ablation study results | Diagnostic experiments | Ablation tables |
| Failure case analysis | Step 1 + Step 3 | Limitations discussion |
| Key implementation details and tricks | Steps 3-5 | Method section / Supplementary |
| Baseline comparison results | Step 2 | Comparison tables |
| Topic | Reference File | When to Use |
|---|---|---|
| Debugging methodology | debugging-methodology.md | Diagnosing why experiments fail |
| Experiment log template | experiment-log-template.md | Recording experiment details |