Iterative deep refinement of a research idea via Problem Anchor + skeleton extraction + external LLM review. Turns a rough idea into a venue-ready proposal. Use when user says "refine idea", "打磨idea", "deepen this idea", "flesh out", "细化方案", or wants to turn a rough idea into a focused, concrete proposal.
Refine and concretize: $ARGUMENTS
Use this skill when a research idea exists but needs to be turned into a concrete, venue-ready proposal. The goal is not to produce a bloated proposal or a benchmark shopping list. The goal is to turn a rough idea into a problem -> focused method -> minimal validation document that is concrete enough to implement, elegant enough to feel paper-worthy, and current enough to resonate in the foundation-model era.
Four principles dominate this skill:
User input (idea + rough approach)
-> Phase 0 (Claude): Freeze Problem Anchor
-> Phase 0.5 (Claude): Skeleton Extraction
-> Phase 1 (Claude): Scan grounding papers -> identify technical gap -> choose the sharpest route -> write focused proposal
-> Phase 2 (Codex/GPT-5.4): Review for fidelity, specificity, contribution quality, and frontier leverage
-> Phase 3 (Claude): Parse + Top-2 diagnosis + Skeleton gap check -> revise method -> rewrite full proposal
-> Phase 4 (Codex, same thread): Re-evaluate revised proposal
-> Repeat Phase 3-4 until OVERALL SCORE >= 9 or MAX_ROUNDS reached
-> Phase 5: Save full history to refine-logs/
gpt-5.4 — Reviewer model used via Codex MCP.refine-logs/ — Directory for round files and final report.Override via argument if needed, e.g.
/idea-refine "problem | approach" -- max rounds: 2, threshold: 9.
refine-logs/
├── skeleton.md
├── round-0-initial-proposal.md
├── round-1-review.md
├── round-1-refinement.md
├── round-2-review.md
├── round-2-refinement.md
├── ...
├── REVIEW_SUMMARY.md
├── FINAL_PROPOSAL.md
├── REFINEMENT_REPORT.md
└── score-history.md
outputs/
└── PIPELINE_LOG.md # Autonomous decision log (appended each run)
Every round-N-refinement.md must contain a full anchored proposal, not just incremental fixes.
Before proposing anything, extract the user's immutable bottom-line problem. This anchor must be copied verbatim into every proposal and every refinement round.
Write:
If later reviewer feedback would change the problem being solved, mark that as drift and push back or adapt carefully.
Before writing the proposal, extract the logical skeleton of the idea. This step forces clarity about what the proposal must communicate and serves as the structural yardstick for every subsequent phase.
Answer the root question:
"This idea's mission is to move the reviewer from State A to State B. What are A and B?"
This skeleton is the sole yardstick for the proposal structure. Every section must map to a skeleton step. If a section doesn't map, it's either unnecessary or the skeleton is incomplete.
State A: [What the reviewer currently believes]
State B: [What the reviewer must believe after reading]
Skeleton: Step 1 -> Step 2 -> ... -> Step N
For each skeleton step, write one sentence explaining why this step is non-skippable (what breaks in the reader's understanding if it is omitted).
Save to refine-logs/skeleton.md.
Check papers/ and literature/ first. Read only the relevant parts needed to answer:
If local material is insufficient, search recent top-venue/arXiv work online. Focus on method sections, training setup, and failure modes, not just abstracts.
Do not stop at generic research questions. Make the gap operational:
Before locking the method, compare two candidate routes if both are plausible:
Then decide:
If both routes are weak, rethink the framing instead of combining them into a larger system by default.
The proposal must answer "how would we actually build this?" Prefer method detail over broad experimentation and prefer reuse over invention.
Cover:
If the method is still only described as "add a module" or "use a planner," it is not concrete enough.
Instead of a full claim-driven validation section with detailed baselines and ablation designs, write a lightweight evaluation sketch. Detailed experiment planning belongs in a separate /experiment-plan step, not here.
## Evaluation Sketch
- How would this idea be validated? (1-3 sentences)
- What is the key metric?
- What would success look like?
- What would failure look like?
## Resource Estimate
- Scale: SMALL (1 person-week) / MEDIUM (2-4 person-weeks) / LARGE (1-2 person-months)
- Compute: LOW (single GPU) / MEDIUM (multi-GPU) / HIGH (cluster)
- Data: Available / Needs collection / Needs annotation
Save to refine-logs/round-0-initial-proposal.md.
Use this structure:
# Research Proposal: [Title]
## Problem Anchor
- Bottom-line problem:
- Must-solve bottleneck:
- Non-goals:
- Constraints:
- Success condition:
## Skeleton
- State A:
- State B:
- Skeleton Path: Step 1 -> Step 2 -> ... -> Step N
## Technical Gap
[Why current methods fail, why naive bigger systems are not enough, and what mechanism is missing]
## Method Thesis
- One-sentence thesis:
- Why this is the smallest adequate intervention:
- Why this route is timely in the foundation-model era:
## Contribution Focus
- Dominant contribution:
- Optional supporting contribution:
- Explicit non-contributions:
## Proposed Method
### Complexity Budget
- Frozen / reused backbone:
- New trainable components:
- Tempting additions intentionally not used:
### System Overview
[Step-by-step pipeline or ASCII graph]
### Core Mechanism
- Input / output:
- Architecture or policy:
- Training signal / loss:
- Why this is the main novelty:
### Optional Supporting Component
- Only include if truly necessary:
- Input / output:
- Training signal / loss:
- Why it does not create contribution sprawl:
### Modern Primitive Usage
- Which LLM / VLM / Diffusion / RL-era primitive is used:
- Exact role in the pipeline:
- Why it is more natural than an old-school alternative:
### Integration into Base Generator / Downstream Pipeline
[Where the new method attaches, what is frozen, what is trainable, inference order]
### Failure Modes and Diagnostics
- [Failure mode]:
- [How to detect]:
- [Fallback or mitigation]:
### Novelty and Elegance Argument
[Closest work, exact difference, why this is a focused mechanism-level contribution rather than a module pile-up]
## Evaluation Sketch
- How would this idea be validated?
- What is the key metric?
- What would success look like?
- What would failure look like?
## Resource Estimate
- Scale: SMALL / MEDIUM / LARGE
- Compute: LOW / MEDIUM / HIGH
- Data: Available / Needs collection / Needs annotation
Send the full proposal to GPT-5.4 for an elegance-first, frontier-aware, method-first review. The reviewer should spend most of the critique budget on the method itself, not on expanding the experiment menu.
mcp__codex__codex:
model: REVIEWER_MODEL
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are a senior ML reviewer for a top venue (NeurIPS/ICML/ICLR).
This is an early-stage, method-first research proposal.
Your job is NOT to reward extra modules, contribution sprawl, or a giant benchmark checklist.
Your job IS to stress-test whether the proposed method:
(1) still solves the original anchored problem,
(2) is concrete enough to implement,
(3) presents a focused, elegant contribution,
(4) uses foundation-model-era techniques appropriately when they are the natural fit.
Review principles (enforce these strictly):
1. Trace the reader's journey, not the author's intent
2. Every claim needs a reason the reader already has
3. Concepts before terms — don't introduce jargon before the framework
4. One concept, one name — terminology consistency
5. Don't defend, state — no defensive language
6. Precision scales with commitment — don't over-promise
7. Venue awareness — evaluate for the target venue
Additional review stance:
- Prefer the smallest adequate mechanism over a larger system.
- Penalize parallel contributions that make the paper feel unfocused.
- If a modern LLM / VLM / Diffusion / RL route would clearly produce a better paper, say so concretely.
- If the proposal is already modern enough, do NOT force trendy components.
- Do not ask for extra experiments unless they are needed to prove the core claims.
Read the Problem Anchor first. If your suggested fix would change the problem being solved,
call that out explicitly as drift instead of treating it as a normal revision request.
=== PROPOSAL ===
[Paste the FULL proposal from Phase 1]
=== END PROPOSAL ===
Score these 7 dimensions from 1-10:
1. **Problem Fidelity** (weight: 15%): Does the method still attack the original bottleneck, or has it drifted into solving something easier or different?
2. **Method Specificity** (weight: 25%): Are the interfaces, representations, losses, training stages, and inference path concrete enough that an engineer could start implementing?
3. **Contribution Quality** (weight: 25%): Is there one dominant mechanism-level contribution with real novelty, good parsimony, and no obvious contribution sprawl?
4. **Frontier Leverage** (weight: 15%): Does the proposal use current foundation-model-era primitives appropriately when they are the right tool, instead of defaulting to old-school module stacking?
5. **Feasibility** (weight: 10%): Can this method be trained and integrated with the stated resources and data assumptions?
6. **Validation Focus** (weight: 5%): Are the proposed experiments minimal but sufficient to validate the core claims? Is there unnecessary experimental bloat?
7. **Venue Readiness** (weight: 5%): If executed well, would the contribution feel sharp and timely enough for a top venue?
**OVERALL SCORE** (1-10): Weighted as specified above.
For each dimension scoring < 7, provide:
- The specific weakness
- A concrete fix at the method level (interface / loss / training recipe / integration point / deletion of unnecessary parts)
- Priority: CRITICAL / IMPORTANT / MINOR
Then add:
- **Simplification Opportunities**: 1-3 concrete ways to delete, merge, or reuse components while preserving the main claim. Write "NONE" if already tight.
- **Modernization Opportunities**: 1-3 concrete ways to replace old-school pieces with more natural foundation-model-era primitives if genuinely better. Write "NONE" if already modern enough.
- **Drift Warning**: "NONE" if the proposal still solves the anchored problem; otherwise explain the drift clearly.
- **Verdict**: READY / REVISE / RETHINK
Verdict rule:
- READY: overall score >= 9, no meaningful drift, one focused dominant contribution, and no obvious complexity bloat remains
- REVISE: the direction is promising but not yet at READY bar
- RETHINK: the core mechanism or framing is still fundamentally off
Codex MCP failure handling: If mcp__codex__codex is unavailable:
CRITICAL: Save the threadId from this call for all later rounds.
CRITICAL: Save the FULL raw response verbatim.
Save review to refine-logs/round-1-review.md with the raw response in a <details> block.
Extract:
Update refine-logs/score-history.md:
# Score Evolution
| Round | Problem Fidelity | Method Specificity | Contribution Quality | Frontier Leverage | Feasibility | Validation Focus | Venue Readiness | Overall | Verdict |
|-------|------------------|--------------------|----------------------|-------------------|-------------|------------------|-----------------|---------|---------|
| 1 | X | X | X | X | X | X | X | X | REVISE |
STOP CONDITION: If overall score >= SCORE_THRESHOLD, verdict is READY, and there is no unresolved drift warning, skip to Phase 5.
Automatic convergence: When MAX_ROUNDS is reached but score < SCORE_THRESHOLD:
Instead of trying to fix everything the reviewer mentioned, identify exactly the top 2 largest remaining problems. Not 3, not 5 — exactly 2. This forces prioritization and prevents the proposal from oscillating between too many changes per round.
For each of the two issues, answer from first principles:
For each skeleton step (from Phase 0.5), verify:
If a skeleton step has no corresponding section, that is a LOGICAL GAP — a point where a reviewer will attack. For each gap:
If the skeleton itself needs revision based on what was learned during review, update refine-logs/skeleton.md and note the change.
Before changing anything:
Then process reviewer feedback with the Top-2 diagnosis as the primary guide:
Bias the revisions toward:
Do not add multiple parallel contributions just to chase score. If the reviewer requests another module, first ask whether the same gain can come from a better interface, distillation signal, reward model, or inference policy on top of an existing backbone.
Save to refine-logs/round-N-refinement.md:
# Round N Refinement
## Problem Anchor
[Copy verbatim from round 0]
## Anchor Check
- Original bottleneck:
- Why the revised method still addresses it:
- Reviewer suggestions rejected as drift:
## Simplicity Check
- Dominant contribution after revision:
- Components removed or merged:
- Reviewer suggestions rejected as unnecessary complexity:
- Why the remaining mechanism is still the smallest adequate route:
## Top-2 Issues Diagnosed
### Issue 1: [Title]
- Reader experience: [What confusion or skepticism arises]
- Hostile reviewer critique: [Specific attack]
- Structural impact: [Which skeleton step is broken]
### Issue 2: [Title]
- Reader experience: [What confusion or skepticism arises]
- Hostile reviewer critique: [Specific attack]
- Structural impact: [Which skeleton step is broken]
## Skeleton Gap Check
| Skeleton Step | Covered by Section? | Assessment | Action Needed? |
|---------------|---------------------|------------|----------------|
| Step 1 | [section name] | [adequate / superficial / missing] | [yes/no + what] |
| Step 2 | ... | ... | ... |
## Changes Made
### 1. [Method section changed]
- Reviewer said:
- Action:
- Reasoning:
- Impact on core method:
### 2. [Method section changed]
- Reviewer said:
- Action:
- Reasoning:
- Impact on core method:
## Revised Proposal
[Full updated proposal from Problem Anchor through Evaluation Sketch — complete, not incremental]
Send the revised proposal back to GPT-5.4 in the same thread.
Context management: Before sending the re-evaluation prompt, summarize previous rounds to control context size:
mcp__codex__codex-reply:
threadId: [saved from Phase 2]
model: REVIEWER_MODEL
config: {"model_reasoning_effort": "xhigh"}
prompt: |
[Round N re-evaluation]
I revised the proposal based on your feedback.
First, check whether the original Problem Anchor is still preserved.
Second, judge whether the method is now more concrete, more focused, and more current.
Key changes:
1. [Method change 1]
2. [Method change 2]
3. [Simplification / modernization / pushback if any]
=== REVISED PROPOSAL ===
[Paste the FULL revised proposal]
=== END REVISED PROPOSAL ===
Please:
- Re-score the same 7 dimensions and overall
- State whether the Problem Anchor is preserved or drifted
- State whether the dominant contribution is now sharper or still too broad
- State whether the method is simpler or still overbuilt
- State whether the frontier leverage is now appropriate or still old-school / forced
- Focus new critiques on missing mechanism, weak training signal, weak integration point, pseudo-novelty, or unnecessary complexity
- Use the same verdict rule: READY only if overall score >= 9 and no blocking issue remains
Same output format: 7 scores, overall score, verdict, drift warning, simplification opportunities, modernization opportunities, remaining action items.
Save review to refine-logs/round-N-review.md.
Then return to Phase 3 until:
refine-logs/REVIEW_SUMMARY.mdThis file is the high-level round-by-round review record. It should answer: each round was trying to solve what, what changed, what got resolved, and what remained.
# Review Summary
**Problem**: [user's problem]
**Initial Approach**: [user's vague approach]
**Date**: [today]
**Rounds**: N / MAX_ROUNDS
**Final Score**: X / 10
**Final Verdict**: [READY / REVISE / RETHINK]
## Problem Anchor
[Verbatim anchor used across all rounds]
## Skeleton
[Final skeleton from Phase 0.5, noting any revisions made during refinement]
## Round-by-Round Resolution Log
| Round | Main Reviewer Concerns | What This Round Simplified / Modernized | Top-2 Issues Targeted | Solved? | Remaining Risk |
|-------|-------------------------|------------------------------------------|-----------------------|---------|----------------|
| 1 | [top issues from review] | [main method changes] | [the 2 issues] | [yes / partial / no] | [if any] |
| 2 | ... | ... | ... | ... | ... |
## Overall Evolution
- [How the method became more concrete]
- [How the dominant contribution became more focused]
- [How unnecessary complexity was removed]
- [How modern technical leverage improved or stayed intentionally minimal]
- [How drift was avoided or corrected]
- [How skeleton gaps were filled across rounds]
## Final Status
- Anchor status: [preserved / corrected / unresolved]
- Focus status: [tight / slightly broad / still diffuse]
- Modernity status: [appropriately frontier-aware / intentionally conservative / still old-school]
- Skeleton completeness: [all steps covered / gaps remain at steps X, Y]
- Strongest parts of final method:
- Remaining weaknesses:
refine-logs/FINAL_PROPOSAL.mdThis file is the clean final version document. It should contain only the final proposal itself, without review chatter, round history, or raw reviewer output.
# Research Proposal: [Title]
[Paste the final refined proposal only]
If the final verdict is not READY, still write the best current final version here.
refine-logs/REFINEMENT_REPORT.md# Refinement Report
**Problem**: [user's problem]
**Initial Approach**: [user's vague approach]
**Date**: [today]
**Rounds**: N / MAX_ROUNDS
**Final Score**: X / 10
**Final Verdict**: [READY / REVISE / RETHINK]
## Problem Anchor
[Verbatim anchor used across all rounds]
## Skeleton
[Final skeleton]
## Output Files
- Review summary: `refine-logs/REVIEW_SUMMARY.md`
- Final proposal: `refine-logs/FINAL_PROPOSAL.md`
- Skeleton: `refine-logs/skeleton.md`
## Score Evolution
| Round | Problem Fidelity | Method Specificity | Contribution Quality | Frontier Leverage | Feasibility | Validation Focus | Venue Readiness | Overall | Verdict |
|-------|------------------|--------------------|----------------------|-------------------|-------------|------------------|-----------------|---------|---------|
| 1 | ... | ... | ... | ... | ... | ... | ... | ... | ... |
## Round-by-Round Review Record
| Round | Main Reviewer Concerns | Top-2 Issues Targeted | What Was Changed | Result |
|-------|-------------------------|-----------------------|------------------|--------|
| 1 | [top issues] | [the 2 issues] | [main fixes] | [resolved / partial / unresolved] |
| 2 | ... | ... | ... | ... |
## Final Proposal Snapshot
- Canonical clean version lives in `refine-logs/FINAL_PROPOSAL.md`
- Summarize the final thesis in 3-5 bullets here
## Method Evolution Highlights
1. [Most important simplification or focusing move]
2. [Most important mechanism upgrade]
3. [Most important modernization or justification for staying simple]
## Pushback / Drift Log
| Round | Reviewer Said | Author Response | Outcome |
|-------|---------------|-----------------|---------|
| 1 | [criticism] | [pushback + anchor / evidence] | [accepted / rejected] |
## Remaining Weaknesses
[Honest unresolved issues]
## Raw Reviewer Responses
<details>
<summary>Round 1 Review</summary>
[Full verbatim response from GPT-5.4]
</details>
...
## Next Steps
- If READY: the proposal is venue-ready; proceed to experiment planning if needed
- If REVISE: the best available version has been saved as FINAL_PROPOSAL.md. Remaining weaknesses are documented above. A follow-up `/idea-refine` run can target them.
- If RETHINK: the core mechanism may need fundamental revision. Consider running `/idea-screen` to re-evaluate the approach before refining again.
score-history.mdEnsure it contains the complete score evolution table using all 7 dimensions.
Append the following summary to outputs/PIPELINE_LOG.md (create the file if it does not exist). This log enables autonomous pipeline orchestration without requiring user interaction.
## idea-refine completed
- **Rounds**: N / MAX_ROUNDS
- **Final score**: X / 10
- **Final verdict**: READY / REVISE / RETHINK
- **Anchor status**: preserved / drift corrected / unresolved concern
- **Focus status**: tight / slightly broad / still diffuse
- **Modernity status**: appropriately frontier-aware / intentionally conservative / still old-school
- **Skeleton status**: all steps covered / gaps remain at steps X, Y
- **Key method upgrades**: [method change 1], [method change 2]
- **Remaining concerns**: [if any]
- **Output files**:
- Review summary: refine-logs/REVIEW_SUMMARY.md
- Full report: refine-logs/REFINEMENT_REPORT.md
- Final proposal: refine-logs/FINAL_PROPOSAL.md
Then also print a brief summary to stdout:
Refinement complete after N rounds.
Final score: X/10 (Verdict: READY / REVISE / RETHINK)
Key method upgrades:
- [method change 1]
- [method change 2]
Remaining concerns:
- [if any]
Review summary: refine-logs/REVIEW_SUMMARY.md
Full report: refine-logs/REFINEMENT_REPORT.md
Final proposal: refine-logs/FINAL_PROPOSAL.md
Pipeline log: outputs/PIPELINE_LOG.md
outputs/PIPELINE_LOG.md. The pipeline is: freeze anchor, extract skeleton, build proposal, iterate review/revise up to MAX_ROUNDS, output best version.cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.config: {"model_reasoning_effort": "xhigh"} for all Codex review calls.threadId from Phase 2 and use mcp__codex__codex-reply for later rounds.outputs/PIPELINE_LOG.md.This skill sits between idea screening and execution:
/idea-screen -> /idea-refine <- you are here
/idea-pipeline orchestrates the full flow
Typical flow:
/idea-screen evaluates and ranks candidate ideas/idea-refine turns the top idea into an anchored, elegant, frontier-aware method proposalThis skill also works standalone if you already have an idea and just need the method to become concrete and venue-ready.