<HARD-GATE> NO CODE BEFORE PLAN FREEZE (G2). Do NOT write implementation code, run experiments, or build pipelines until the G2 gate checklist at the end of this skill is fully satisfied and user-approved. </HARD-GATE>

Method & Framework Design (Phase 3)

Overview

This is the most critical branching point in the research workflow. Type M and Type D projects follow completely different design paths. Executing the wrong path — or skipping design entirely — guarantees wasted effort.

On-Demand Literature Search (Active Throughout Phase 3)

<IRON-LAW> Phase 1 literature review is the STARTING POINT, not the final word. During method/framework design, you WILL encounter questions that require additional literature:

"Has anyone tried approach X for this problem?"
"What's the current SOTA for sub-task Y?"
"Is there theoretical justification for design choice Z?"
"What hyperparameters did paper W use for this baseline?"

When this happens: SEARCH IMMEDIATELY. Do not guess, do not rely on memory, do not defer to "we'll check later." Use web search, arXiv, Google Scholar, or any available tools to find the answer NOW.

<IRON-LAW> For Type M, the method design determines whether the paper will be a genuine contribution or a trivial variation. A single agent designing a method has inherent blind spots — it may overestimate novelty, underestimate baselines, or miss simpler alternatives. Multi-agent deliberation catches these before months of implementation. <IRON-LAW> For Type D, the data IS the foundation. A poor data choice guarantees a weak paper regardless of how good the analysis is. Before designing the analysis framework, validate the data choice: </IRON-LAW> <IRON-LAW> Even for Type D, the analysis framework must be vetted by multiple perspectives. A poorly designed analysis plan leads to shallow findings that reviewers will tear apart. The analysis must be SUFFICIENT in scope — it's better to plan slightly more analyses than needed and filter in Phase 5, than to discover gaps after execution. </IRON-LAW> <IRON-LAW> ## ⛔ MANDATORY STOP — After G2 Gate

On-Demand Literature Search (Active Throughout Phase 3)

"Has anyone tried approach X for this problem?"

"What's the current SOTA for sub-task Y?"

"Is there theoretical justification for design choice Z?"

"What hyperparameters did paper W use for this baseline?"

Check	Compatible	Action if incompatible
Same metric	✅	Use directly
Different metric	❌	Must re-run with correct metric
Same dataset + split	✅	Use directly
Different split	❌	Must re-run with correct split
≥ 3 seeds	✅	Use directly (even if different seed values)
< 3 seeds or unknown	⚠️	Accept as reference but re-run for official comparison
Same execution regime (e.g., same epochs, same HP budget)	✅	Use directly
Different regime	⚠️	Use as reference; consider re-running for fairness

Agent	Role	What They Optimize For
Innovation Advisor	Creativity and novelty expert	Is this idea truly novel? Could it be more creative? What cross-domain inspiration is being missed?
Technical Architect	Implementation and scalability expert	Can this actually be built and trained? Are there engineering bottlenecks? Is the design clean?
Baseline Devil's Advocate	Competitor and ablation expert	Can a simpler method achieve the same result? Are baselines strong enough? What will reviewers compare against?

Agent	Role	What They Optimize For
Domain Scientist	Expert in the biological/scientific domain	Are we asking the right biological questions? Are the expected findings meaningful?
Methodology Consultant	Expert in analytical methods	Are the right tools/methods chosen? Are there better analytical approaches?
Statistical Rigor Advisor	Expert in experimental design and statistics	Are the analyses statistically sound? Are confounders handled? Is the evidence sufficient?

Evaluation Dimension	What to Measure	How
Correctness	Does it produce correct results?	Compare output vs reference / gold standard
Performance	How fast? Memory?	Benchmark suite (standardized if available)
Scalability	How does it scale with input size?	Scaling curves (1×, 10×, 100× input)
Comparison	How does it compare to existing tools?	Head-to-head on same benchmarks
Usability (optional)	How easy to use?	API design review, installation test, example workflows
Case studies	Does it work on real problems?	2-3 real-world use cases from the domain

Agent	Role	What They Optimize For
Target User	Domain practitioner who would use the tool	Does this solve my actual workflow? Is the API intuitive? What's missing?
Competing Tool Expert	Expert on existing tools in this space	How does this compare? What advantages do competitors have that we're ignoring?
Software Quality Advisor	Software engineering and architecture expert	Is the codebase maintainable? Documentation sufficient? Installation easy? Edge cases handled?

Excuse	Reality
"Let me just try something first"	Undirected experiments waste compute and produce uninterpretable results. Design first.
"The evaluation can be decided later"	Post-hoc evaluation selection is cherry-picking. Lock metrics NOW.
"This is too simple to need a full design"	Simple methods still need locked evaluation, baselines, and a story line. Simplicity ≠ no planning.
"I already know what will work"	Then it should be trivial to write down. If you can't write it, you don't know it.
"Baselines are obvious"	Obvious to you ≠ convincing to reviewers. Document and confirm.
"Analysis doesn't need a framework"	Unstructured analysis produces shallow, incomplete results every time.

Method Framework Design