End-to-end pipeline for writing ML/AI research papers — from experiment design through analysis, drafting, revision, and submission. Covers NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Integrates automated experiment monitoring, statistical analysis, iterative writing, and citation verification.
End-to-end pipeline for producing publication-ready ML/AI research papers targeting NeurIPS, ICML, ICLR, ACL, AAAI, and COLM. This skill covers the full research lifecycle: experiment design, execution, monitoring, analysis, paper writing, review, revision, and submission.
This is not a linear pipeline — it is an iterative loop. Results trigger new experiments. Reviews trigger new analysis. The agent must handle these feedback loops.
┌─────────────────────────────────────────────────────────────┐
│ RESEARCH PAPER PIPELINE │
│ │
│ Phase 0: Project Setup ──► Phase 1: Literature Review │
│ │ │ │
│ ▼ ▼ │
│ Phase 2: Experiment Phase 5: Paper Drafting ◄──┐ │
│ Design │ │ │
│ │ ▼ │ │
│ ▼ Phase 6: Self-Review │ │
│ Phase 3: Execution & & Revision ──────────┘ │
│ Monitoring │ │
│ │ ▼ │
│ ▼ Phase 7: Submission │
│ Phase 4: Analysis ─────► (feeds back to Phase 2 or 5) │
│ │
└─────────────────────────────────────────────────────────────┘
Use this skill when:
[CITATION NEEDED].Default: Be proactive. Draft first, ask with the draft.
| Confidence Level | Action |
|---|---|
| High (clear repo, obvious contribution) | Write full draft, deliver, iterate on feedback |
| Medium (some ambiguity) | Write draft with flagged uncertainties, continue |
| Low (major unknowns) | Ask 1-2 targeted questions via clarify, then draft |
| Section | Draft Autonomously? | Flag With Draft |
|---|---|---|
| Abstract | Yes | "Framed contribution as X — adjust if needed" |
| Introduction | Yes | "Emphasized problem Y — correct if wrong" |
| Methods | Yes | "Included details A, B, C — add missing pieces" |
| Experiments | Yes | "Highlighted results 1, 2, 3 — reorder if needed" |
| Related Work | Yes | "Cited papers X, Y, Z — add any I missed" |
Block for input only when: target venue unclear, multiple contradictory framings, results seem incomplete, explicit request to review first.
Goal: Establish the workspace, understand existing work, identify the contribution.
# Understand project structure
ls -la
find . -name "*.py" | head -30
find . -name "*.md" -o -name "*.txt" | xargs grep -l -i "result\|conclusion\|finding"
Look for:
README.md — project overview and claimsresults/, outputs/, experiments/ — existing findingsconfigs/ — experimental settings.bib files — existing citationsEstablish a consistent workspace structure:
workspace/
paper/ # LaTeX source, figures, compiled PDFs
experiments/ # Experiment runner scripts
code/ # Core method implementation
results/ # Raw experiment results (auto-generated)
tasks/ # Task/benchmark definitions
human_eval/ # Human evaluation materials (if needed)
git init # if not already
git remote add origin <repo-url>
git checkout -b paper-draft # or main
Git discipline: Every completed experiment batch gets committed with a descriptive message. Example:
Add Monte Carlo constrained results (5 runs, Sonnet 4.6, policy memo task)
Add Haiku baseline comparison: autoreason vs refinement baselines at cheap model tier
Before writing anything, articulate:
Propose to the scientist: "Based on my understanding, the main contribution is: [one sentence]. The key results show [Y]. Is this the framing you want?"
Use the todo tool to create a structured project plan:
Research Paper TODO:
- [ ] Define one-sentence contribution
- [ ] Literature review (related work + baselines)
- [ ] Design core experiments
- [ ] Run experiments
- [ ] Analyze results
- [ ] Write first draft
- [ ] Self-review (simulate reviewers)
- [ ] Revise based on review
- [ ] Submission prep
Update this throughout the project. It serves as the persistent state across sessions.
Goal: Find related work, identify baselines, gather citations.
Start from papers already referenced in the codebase:
# Via terminal:
grep -r "arxiv\|doi\|cite" --include="*.md" --include="*.bib" --include="*.py"
find . -name "*.bib"
Load the arxiv skill for structured paper discovery: skill_view("arxiv"). It provides arXiv REST API search, Semantic Scholar citation graphs, author profiles, and BibTeX generation.
Use web_search for broad discovery, web_extract for fetching specific papers:
# Via web_search:
web_search("[main technique] + [application domain] site:arxiv.org")
web_search("[baseline method] comparison ICML NeurIPS 2024")
# Via web_extract (for specific papers):
web_extract("https://arxiv.org/abs/2303.17651")
Additional search queries to try:
Search queries:
- "[main technique] + [application domain]"
- "[baseline method] comparison"
- "[problem name] state-of-the-art"
- Author names from existing citations
Recommended: Install Exa MCP for real-time academic search:
claude mcp add exa -- npx -y mcp-remote "https://mcp.exa.ai/mcp"
NEVER generate BibTeX from memory. ALWAYS fetch programmatically.
For each citation, follow the mandatory 5-step process:
Citation Verification (MANDATORY per citation):
1. SEARCH → Query Semantic Scholar or Exa MCP with specific keywords
2. VERIFY → Confirm paper exists in 2+ sources (Semantic Scholar + arXiv/CrossRef)
3. RETRIEVE → Get BibTeX via DOI content negotiation (programmatically, not from memory)
4. VALIDATE → Confirm the claim you're citing actually appears in the paper
5. ADD → Add verified BibTeX to bibliography
If ANY step fails → mark as [CITATION NEEDED], inform scientist
# Fetch BibTeX via DOI
import requests
def doi_to_bibtex(doi: str) -> str:
response = requests.get(
f"https://doi.org/{doi}",
headers={"Accept": "application/x-bibtex"}
)
response.raise_for_status()
return response.text
If you cannot verify a citation:
\cite{PLACEHOLDER_author2024_verify_this} % TODO: Verify this citation exists
Always tell the scientist: "I've marked [X] citations as placeholders that need verification."
See references/citation-workflow.md for complete API documentation and the full CitationManager class.
Group papers by methodology, not paper-by-paper:
Good: "One line of work uses X's assumption [refs] whereas we use Y's assumption because..." Bad: "Smith et al. introduced X. Jones et al. introduced Y. We combine both."
Goal: Design experiments that directly support paper claims. Every experiment must answer a specific question.
Create an explicit mapping:
| Claim | Experiment | Expected Evidence |
|---|---|---|
| "Our method outperforms baselines" | Main comparison (Table 1) | Win rate, statistical significance |
| "Effect is larger for weaker models" | Model scaling study | Monotonic improvement curve |
| "Convergence requires scope constraints" | Constrained vs unconstrained | Convergence rate comparison |
Rule: If an experiment doesn't map to a claim, don't run it.
Strong baselines are what separates accepted papers from rejected ones. Reviewers will ask: "Did they compare against X?"
Standard baseline categories:
Before running anything, specify:
Follow these patterns from successful research pipelines:
Incremental saving — save results after each step for crash recovery:
# Save after each problem/task
result_path = f"results/{task}/{strategy}/result.json"
if os.path.exists(result_path):
continue # Skip already-completed work
# ... run experiment ...
with open(result_path, 'w') as f:
json.dump(result, f, indent=2)
Artifact preservation — save all intermediate outputs:
results/<experiment>/
<task>/
<strategy>/
final_output.md # Final result
history.json # Full trajectory
pass_01/ # Per-iteration artifacts
version_a.md
version_b.md
critic.md
Separation of concerns — keep generation, evaluation, and visualization separate:
run_experiment.py # Core experiment runner
run_baselines.py # Baseline comparison
run_comparison_judge.py # Blind evaluation
analyze_results.py # Statistical analysis
make_charts.py # Visualization
See references/experiment-patterns.md for complete design patterns, cron monitoring, and error recovery.
Goal: Run experiments reliably, monitor progress, recover from failures.
Use nohup for long-running experiments:
nohup python run_experiment.py --config config.yaml > logs/experiment_01.log 2>&1 &
echo $! # Record the PID
Parallel execution: Run independent experiments simultaneously, but be aware of API rate limits. 4+ concurrent experiments on the same API will slow each down.
For long-running experiments, set up periodic status checks. The cron prompt should follow this template:
Monitor Prompt Template:
1. Check if process is still running: ps aux | grep <pattern>
2. Read last 30 lines of log: tail -30 <logfile>
3. Check for completed results: ls <result_dir>
4. If results exist, read and report: cat <result_file>
5. If all done, commit: git add -A && git commit -m "<descriptive message>" && git push
6. Report in structured format (tables with key metrics)
7. Answer the key analytical question for this experiment
Silent mode: If nothing has changed since the last check, respond with [SILENT] to suppress notification to the user. Only report when there's news.
Common failure modes and recovery:
| Failure | Detection | Recovery |
|---|---|---|
| API rate limit / credit exhaustion | 402/429 errors in logs | Wait, then re-run (scripts skip completed work) |
| Process crash | PID gone, incomplete results | Re-run from last checkpoint |
| Timeout on hard problems | Process stuck, no log progress | Kill and skip, note in results |
| Wrong model ID | Errors referencing model name | Fix ID and re-run |
Key: Scripts should always check for existing results and skip completed work. This makes re-runs safe and efficient.
After each experiment batch completes:
git add -A
git commit -m "Add <experiment name>: <key finding in 1 line>"
git push
Goal: Extract findings, compute statistics, identify the story.
Write analysis scripts that:
# Standard analysis pattern
import json, os
from pathlib import Path
results = {}
for result_file in Path("results/").rglob("result.json"):
data = json.loads(result_file.read_text())
strategy = result_file.parent.name
task = result_file.parent.parent.name
results.setdefault(strategy, {})[task] = data
# Compute aggregate metrics
for strategy, tasks in results.items():
scores = [t["score"] for t in tasks.values()]
print(f"{strategy}: mean={np.mean(scores):.1f}, std={np.std(scores):.1f}")
Always compute:
See references/experiment-patterns.md for complete implementations of McNemar's test, bootstrapped CIs, and Cohen's h.
After analysis, explicitly answer:
Figures:
plt.savefig('fig.pdf')Tables:
booktabs LaTeX package\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}
| Situation | Action |
|---|---|
| Core claims supported, results significant | Move to Phase 5 (writing) |
| Results inconclusive, need more data | Back to Phase 2 (design) |
| Unexpected finding suggests new direction | Back to Phase 2 (design) |
| Missing one ablation reviewers will ask for | Run it, then Phase 5 |
| All experiments done but some failed | Note failures, move to Phase 5 |
Any output in this pipeline — paper drafts, experiment scripts, analysis — can be iteratively refined. The autoreason research provides empirical evidence for when each refinement strategy works and when it fails. Use this section to choose the right approach.
| Your Situation | Strategy | Why |
|---|---|---|
| Mid-tier model + constrained task | Autoreason | Sweet spot. Generation-evaluation gap is widest. Baselines actively destroy weak model outputs. |
| Mid-tier model + open task | Autoreason with scope constraints added | Add fixed facts, structure, or deliverable to bound the improvement space. |
| Frontier model + constrained task | Autoreason | Wins 2/3 constrained tasks even at frontier. |
| Frontier model + unconstrained task | Critique-and-revise or single pass | Autoreason comes last. Model self-evaluates well enough. |
| Concrete technical task (system design) | Critique-and-revise | Direct find-and-fix loop is more efficient. |
| Template-filling task (one correct structure) | Single pass or conservative | Minimal decision space. Iteration adds no value. |
| Code with test cases | Autoreason (code variant) | Structured analysis of why it failed before fixing. Recovery rate 62% vs 43%. |
| Very weak model (Llama 8B class) | Single pass | Model too weak for diverse candidates. Invest in generation quality. |
Core insight: Autoreason's value depends on the gap between a model's generation capability and its self-evaluation capability.
Model Tier │ Generation │ Self-Eval │ Gap │ Autoreason Value
──────────────────┼────────────┼───────────┼────────┼─────────────────
Weak (Llama 8B) │ Poor │ Poor │ Small │ None — can't generate diverse candidates
Mid (Haiku 3.5) │ Decent │ Poor │ LARGE │ MAXIMUM — 42/42 perfect Borda
Mid (Gemini Flash)│ Decent │ Moderate │ Large │ High — wins 2/3
Strong (Sonnet 4) │ Good │ Decent │ Medium │ Moderate — wins 3/5
Frontier (S4.6) │ Excellent │ Good │ Small │ Only with constraints
This gap is structural, not temporary. As costs drop, today's frontier becomes tomorrow's mid-tier. The sweet spot moves but never disappears.
Each pass produces three candidates from fresh, isolated agents:
Key parameters:
When refining the paper itself through autoreason:
| Failure | Detection | Fix |
|---|---|---|
| No convergence (A never wins) | A wins <15% over 20+ passes | Add scope constraints to the task |
| Synthesis drift | Word counts grow unboundedly | Constrain structure and deliverable |
| Degradation below single pass | Baselines score higher than iterated output | Switch to single pass; model may be too weak |
| Overfitting (code) | High public-test pass, low private-test pass | Use structured analysis, not just test feedback |
| Broken judges | Parsing failures reduce panel below 3 | Fix parser before continuing |
See references/autoreason-methodology.md for complete prompts, Borda scoring details, model selection guide, scope constraint design patterns, and compute budget reference.
Goal: Write a complete, publication-ready paper.
The single most critical insight: Your paper is not a collection of experiments — it's a story with one clear contribution supported by evidence.
Every successful ML paper centers on what Neel Nanda calls "the narrative": a short, rigorous, evidence-based technical story with a takeaway readers care about.
Three Pillars (must be crystal clear by end of introduction):
| Pillar | Description | Test |
|---|---|---|
| The What | 1-3 specific novel claims | Can you state them in one sentence? |
| The Why | Rigorous empirical evidence | Do experiments distinguish your hypothesis from alternatives? |
| The So What | Why readers should care | Does this connect to a recognized community problem? |
If you cannot state your contribution in one sentence, you don't yet have a paper.
Spend approximately equal time on each of:
Why? Most reviewers form judgments before reaching your methods. Readers encounter your paper as: title → abstract → introduction → figures → maybe the rest.
Paper Writing Checklist:
- [ ] Step 1: Define the one-sentence contribution
- [ ] Step 2: Draft Figure 1 (core idea or most compelling result)
- [ ] Step 3: Draft abstract (5-sentence formula)
- [ ] Step 4: Draft introduction (1-1.5 pages max)
- [ ] Step 5: Draft methods
- [ ] Step 6: Draft experiments & results
- [ ] Step 7: Draft related work
- [ ] Step 8: Draft conclusion & discussion
- [ ] Step 9: Draft limitations (REQUIRED by all venues)
- [ ] Step 10: Plan appendix (proofs, extra experiments, details)
- [ ] Step 11: Complete paper checklist
- [ ] Step 12: Final review
The title is the single most-read element of the paper. It determines whether anyone clicks through to the abstract.
Good titles:
Bad titles:
Rules:
From Sebastian Farquhar (DeepMind):
1. What you achieved: "We introduce...", "We prove...", "We demonstrate..."
2. Why this is hard and important
3. How you do it (with specialist keywords for discoverability)
4. What evidence you have
5. Your most remarkable number/result
Delete generic openings like "Large language models have achieved remarkable success..."
Figure 1 is the second thing most readers look at (after abstract). Draft it before writing the introduction — it forces you to clarify the core idea.
| Figure 1 Type | When to Use | Example |
|---|---|---|
| Method diagram | New architecture or pipeline | TikZ flowchart showing your system |
| Results teaser | One compelling result tells the whole story | Bar chart: "Ours vs baselines" with clear gap |
| Problem illustration | The problem is unintuitive | Before/after showing failure mode you fix |
| Conceptual diagram | Abstract contribution needs visual grounding | 2x2 matrix of method properties |
Rules: Figure 1 must be understandable without reading any text. The caption alone should communicate the core idea. Use color purposefully — don't just decorate.
Must include:
Enable reimplementation:
For each experiment, explicitly state:
Requirements:
Organize methodologically, not paper-by-paper. Cite generously — reviewers likely authored relevant papers.
All major conferences require this. Honesty helps:
Conclusion (required, 0.5-1 page):
Discussion (optional, sometimes combined with conclusion):
Do NOT introduce new results or claims in the conclusion.
Appendices are unlimited at all major venues and are essential for reproducibility. Structure:
| Appendix Section | What Goes Here |
|---|---|
| Proofs & Derivations | Full proofs too long for main text. Main text can state theorems with "proof in Appendix A." |
| Additional Experiments | Ablations, scaling curves, per-dataset breakdowns, hyperparameter sensitivity |
| Implementation Details | Full hyperparameter tables, training details, hardware specs, random seeds |
| Dataset Documentation | Data collection process, annotation guidelines, licensing, preprocessing |
| Prompts & Templates | Exact prompts used (for LLM-based methods), evaluation templates |
| Human Evaluation | Annotation interface screenshots, instructions given to annotators, IRB details |
| Additional Figures | Per-task breakdowns, trajectory visualizations, failure case examples |
Rules:
\appendix command, then \section{A: Proofs} etc.When over the page limit:
| Cut Strategy | Saves | Risk |
|---|---|---|
| Move proofs to appendix | 0.5-2 pages | Low — standard practice |
| Condense related work | 0.5-1 page | Medium — may miss key citations |
| Combine tables with subfigures | 0.25-0.5 page | Low — often improves readability |
Use \vspace{-Xpt} sparingly | 0.1-0.3 page | Low if subtle, high if obvious |
| Remove qualitative examples | 0.5-1 page | Medium — reviewers like examples |
| Reduce figure sizes | 0.25-0.5 page | High — figures must remain readable |
Do NOT: reduce font size, change margins, remove required sections (limitations, broader impact), or use \small/\footnotesize for main text.
Sentence-level clarity (Gopen & Swan's 7 Principles):
| Principle | Rule |
|---|---|
| Subject-verb proximity | Keep subject and verb close |
| Stress position | Place emphasis at sentence ends |
| Topic position | Put context first, new info after |
| Old before new | Familiar info → unfamiliar info |
| One unit, one function | Each paragraph makes one point |
| Action in verb | Use verbs, not nominalizations |
| Context before new | Set stage before presenting |
Word choice (Lipton, Steinhardt):
Full writing guide with examples: See references/writing-guide.md
Always copy the entire template directory first, then write within it.
Template Setup Checklist:
- [ ] Step 1: Copy entire template directory to new project
- [ ] Step 2: Verify template compiles as-is (before any changes)
- [ ] Step 3: Read the template's example content to understand structure
- [ ] Step 4: Replace example content section by section
- [ ] Step 5: Use template macros (check preamble for \newcommand definitions)
- [ ] Step 6: Clean up template artifacts only at the end
Step 1: Copy the Full Template
cp -r templates/neurips2025/ ~/papers/my-paper/
cd ~/papers/my-paper/
ls -la # Should see: main.tex, neurips.sty, Makefile, etc.
Copy the ENTIRE directory, not just the .tex file. Templates include style files (.sty), bibliography styles (.bst), example content, and Makefiles.
Step 2: Verify Template Compiles First
Before making ANY changes:
latexmk -pdf main.tex
# Or manual: pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex
If the unmodified template doesn't compile, fix that first (usually missing TeX packages — install via tlmgr install <package>).
Step 3: Keep Template Content as Reference
Don't immediately delete example content. Comment it out and use as formatting reference:
% Template example (keep for reference):
% \begin{figure}[t]
% \centering
% \includegraphics[width=0.8\linewidth]{example-image}
% \caption{Template shows caption style}
% \end{figure}
% Your actual figure:
\begin{figure}[t]
\centering
\includegraphics[width=0.8\linewidth]{your-figure.pdf}
\caption{Your caption following the same style.}
\end{figure}
Step 4: Replace Content Section by Section
Work through systematically: title/authors → abstract → introduction → methods → experiments → related work → conclusion → references → appendix. Compile after each section.
Step 5: Use Template Macros
\newcommand{\method}{YourMethodName} % Consistent method naming
\newcommand{\eg}{e.g.,\xspace} % Proper abbreviations
\newcommand{\ie}{i.e.,\xspace}
| Pitfall | Problem | Solution |
|---|---|---|
Copying only .tex file | Missing .sty, won't compile | Copy entire directory |
Modifying .sty files | Breaks conference formatting | Never edit style files |
| Adding random packages | Conflicts, breaks template | Only add if necessary |
| Deleting template content early | Lose formatting reference | Keep as comments until done |
| Not compiling frequently | Errors accumulate | Compile after each section |
| Raster PNGs for figures | Blurry in paper | Always use vector PDF via savefig('fig.pdf') |
| Conference | Main File | Style File | Page Limit |
|---|---|---|---|
| NeurIPS 2025 | main.tex | neurips.sty | 9 pages |
| ICML 2026 | example_paper.tex | icml2026.sty | 8 pages |
| ICLR 2026 | iclr2026_conference.tex | iclr2026_conference.sty | 9 pages |
| ACL 2025 | acl_latex.tex | acl.sty | 8 pages (long) |
| AAAI 2026 | aaai2026-unified-template.tex | aaai2026.sty | 7 pages |
| COLM 2025 | colm2025_conference.tex | colm2025_conference.sty | 9 pages |
Universal: Double-blind, references don't count, appendices unlimited, LaTeX required.
Templates in templates/ directory. See templates/README.md for compilation setup (VS Code, CLI, Overleaf, other IDEs).
Tables — use booktabs for professional formatting:
\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}
Rules:
Figures:
plt.savefig('fig.pdf')For converting between venues, see Phase 7 (Submission Preparation) — it covers the full conversion workflow, page-change table, and post-rejection guidance.
Add these packages to any paper for professional quality. They are compatible with all major conference style files:
% --- Professional Packages (add after conference style file) ---
% Typography
\usepackage{microtype} % Microtypographic improvements (protrusion, expansion)
% Makes text noticeably more polished — always include
% Tables
\usepackage{booktabs} % Professional table rules (\toprule, \midrule, \bottomrule)
\usepackage{siunitx} % Consistent number formatting, decimal alignment
% Usage: \num{12345} → 12,345; \SI{3.5}{GHz} → 3.5 GHz
% Table alignment: S column type for decimal-aligned numbers
% Figures
\usepackage{graphicx} % Include graphics (\includegraphics)
\usepackage{subcaption} % Subfigures with (a), (b), (c) labels
% Usage: \begin{subfigure}{0.48\textwidth} ... \end{subfigure}
% Diagrams and Algorithms
\usepackage{tikz} % Programmable vector diagrams
\usetikzlibrary{arrows.meta, positioning, shapes.geometric, calc, fit, backgrounds}
\usepackage[ruled,vlined]{algorithm2e} % Professional pseudocode
% Alternative: \usepackage{algorithmicx} if template bundles it
% Cross-references
\usepackage{cleveref} % Smart references: \cref{fig:x} → "Figure 1"
% MUST be loaded AFTER hyperref
% Handles: figures, tables, sections, equations, algorithms
% Math (usually included by conference .sty, but verify)
\usepackage{amsmath,amssymb} % AMS math environments and symbols
\usepackage{mathtools} % Extends amsmath (dcases, coloneqq, etc.)
% Colors (for figures and diagrams)
\usepackage{xcolor} % Color management
% Okabe-Ito colorblind-safe palette:
\definecolor{okblue}{HTML}{0072B2}
\definecolor{okorange}{HTML}{E69F00}
\definecolor{okgreen}{HTML}{009E73}
\definecolor{okred}{HTML}{D55E00}
\definecolor{okpurple}{HTML}{CC79A7}
\definecolor{okcyan}{HTML}{56B4E9}
\definecolor{okyellow}{HTML}{F0E442}
Notes:
microtype is the single highest-impact package for visual quality. It adjusts character spacing at a sub-pixel level. Always include it.siunitx handles decimal alignment in tables via the S column type — eliminates manual spacing.cleveref must be loaded after hyperref. Most conference .sty files load hyperref, so put cleveref last.algorithm, amsmath, graphicx). Don't double-load.siunitx makes number-heavy tables significantly more readable:
\begin{tabular}{l S[table-format=2.1] S[table-format=2.1] S[table-format=2.1]}
\toprule
Method & {Accuracy $\uparrow$} & {F1 $\uparrow$} & {Latency (ms) $\downarrow$} \\
\midrule
Baseline & 85.2 & 83.7 & 45.3 \\
Ablation (no X) & 87.1 & 85.4 & 42.1 \\
\textbf{Ours} & \textbf{92.1} & \textbf{90.8} & \textbf{38.7} \\
\bottomrule
\end{tabular}
The S column type auto-aligns on the decimal point. Headers in {} escape the alignment.
Standard pattern for side-by-side figures:
\begin{figure}[t]
\centering
\begin{subfigure}[b]{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{fig_results_a.pdf}
\caption{Results on Dataset A.}
\label{fig:results-a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{fig_results_b.pdf}
\caption{Results on Dataset B.}
\label{fig:results-b}
\end{subfigure}
\caption{Comparison of our method across two datasets. (a) shows the scaling
behavior and (b) shows the ablation results. Both use 5 random seeds.}
\label{fig:results}
\end{figure}
Use \cref{fig:results} → "Figure 1", \cref{fig:results-a} → "Figure 1a".
\begin{algorithm}[t]
\caption{Iterative Refinement with Judge Panel}
\label{alg:method}
\KwIn{Task $T$, model $M$, judges $J_1 \ldots J_n$, convergence threshold $k$}
\KwOut{Final output $A^*$}
$A \gets M(T)$ \tcp*{Initial generation}
$\text{streak} \gets 0$\;
\While{$\text{streak} < k$}{
$C \gets \text{Critic}(A, T)$ \tcp*{Identify weaknesses}
$B \gets M(T, C)$ \tcp*{Revised version addressing critique}
$AB \gets \text{Synthesize}(A, B)$ \tcp*{Merge best elements}
\ForEach{judge $J_i$}{
$\text{rank}_i \gets J_i(\text{shuffle}(A, B, AB))$ \tcp*{Blind ranking}
}
$\text{winner} \gets \text{BordaCount}(\text{ranks})$\;
\eIf{$\text{winner} = A$}{
$\text{streak} \gets \text{streak} + 1$\;
}{
$A \gets \text{winner}$; $\text{streak} \gets 0$\;
}
}
\Return{$A$}\;
\end{algorithm}
TikZ is the standard for method diagrams in ML papers. Common patterns:
Pipeline/Flow Diagram (most common in ML papers):
\begin{figure}[t]
\centering
\begin{tikzpicture}[
node distance=1.8cm,
box/.style={rectangle, draw, rounded corners, minimum height=1cm,
minimum width=2cm, align=center, font=\small},
arrow/.style={-{Stealth[length=3mm]}, thick},
]
\node[box, fill=okcyan!20] (input) {Input\\$x$};
\node[box, fill=okblue!20, right of=input] (encoder) {Encoder\\$f_\theta$};
\node[box, fill=okgreen!20, right of=encoder] (latent) {Latent\\$z$};
\node[box, fill=okorange!20, right of=latent] (decoder) {Decoder\\$g_\phi$};
\node[box, fill=okred!20, right of=decoder] (output) {Output\\$\hat{x}$};
\draw[arrow] (input) -- (encoder);
\draw[arrow] (encoder) -- (latent);
\draw[arrow] (latent) -- (decoder);
\draw[arrow] (decoder) -- (output);
\end{tikzpicture}
\caption{Architecture overview. The encoder maps input $x$ to latent
representation $z$, which the decoder reconstructs.}
\label{fig:architecture}
\end{figure}
Comparison/Matrix Diagram (for showing method variants):
\begin{tikzpicture}[
cell/.style={rectangle, draw, minimum width=2.5cm, minimum height=1cm,
align=center, font=\small},
header/.style={cell, fill=gray!20, font=\small\bfseries},
]
% Headers
\node[header] at (0, 0) {Method};
\node[header] at (3, 0) {Converges?};
\node[header] at (6, 0) {Quality?};
% Rows
\node[cell] at (0, -1) {Single Pass};
\node[cell, fill=okgreen!15] at (3, -1) {N/A};
\node[cell, fill=okorange!15] at (6, -1) {Baseline};
\node[cell] at (0, -2) {Critique+Revise};
\node[cell, fill=okred!15] at (3, -2) {No};
\node[cell, fill=okred!15] at (6, -2) {Degrades};
\node[cell] at (0, -3) {Ours};
\node[cell, fill=okgreen!15] at (3, -3) {Yes ($k$=2)};
\node[cell, fill=okgreen!15] at (6, -3) {Improves};
\end{tikzpicture}
Iterative Loop Diagram (for methods with feedback):
\begin{tikzpicture}[
node distance=2cm,
box/.style={rectangle, draw, rounded corners, minimum height=0.8cm,
minimum width=1.8cm, align=center, font=\small},
arrow/.style={-{Stealth[length=3mm]}, thick},
label/.style={font=\scriptsize, midway, above},
]
\node[box, fill=okblue!20] (gen) {Generator};
\node[box, fill=okred!20, right=2.5cm of gen] (critic) {Critic};
\node[box, fill=okgreen!20, below=1.5cm of $(gen)!0.5!(critic)$] (judge) {Judge Panel};
\draw[arrow] (gen) -- node[label] {output $A$} (critic);
\draw[arrow] (critic) -- node[label, right] {critique $C$} (judge);
\draw[arrow] (judge) -| node[label, left, pos=0.3] {winner} (gen);
\end{tikzpicture}
Essential for rebuttals — generates a marked-up PDF showing changes between versions:
# Install
# macOS: brew install latexdiff (or comes with TeX Live)
# Linux: sudo apt install latexdiff
# Generate diff
latexdiff paper_v1.tex paper_v2.tex > paper_diff.tex
pdflatex paper_diff.tex
# For multi-file projects (with \input{} or \include{})
latexdiff --flatten paper_v1.tex paper_v2.tex > paper_diff.tex
This produces a PDF with deletions in red strikethrough and additions in blue — standard format for rebuttal supplements.
Install and use for publication-quality plots:
pip install SciencePlots
import matplotlib.pyplot as plt
import scienceplots # registers styles
# Use science style (IEEE-like, clean)
with plt.style.context(['science', 'no-latex']):
fig, ax = plt.subplots(figsize=(3.5, 2.5)) # Single-column width
ax.plot(x, y, label='Ours', color='#0072B2')
ax.plot(x, y2, label='Baseline', color='#D55E00', linestyle='--')
ax.set_xlabel('Training Steps')
ax.set_ylabel('Accuracy')
ax.legend()
fig.savefig('paper/fig_results.pdf', bbox_inches='tight')
# Available styles: 'science', 'ieee', 'nature', 'science+ieee'
# Add 'no-latex' if LaTeX is not installed on the machine generating plots
Standard figure sizes (two-column format):
figsize=(3.5, 2.5) — fits in one columnfigsize=(7.0, 3.0) — spans both columnsfigsize=(3.5, 3.5) — for heatmaps, confusion matricesGoal: Simulate the review process before submission. Catch weaknesses early.
Generate reviews from multiple perspectives using strong models (Opus 4, Sonnet 4.6, Gemini 2.5 Pro). Use the reviewer guidelines from the target venue.
Review prompt template:
You are an expert reviewer for [VENUE]. Review this paper according to the
official reviewer guidelines. Evaluate:
1. Quality (technical soundness, baselines, claims supported by evidence)
2. Clarity (writing, notation consistency, reproducibility)
3. Significance (impact, importance of the problem)
4. Originality (novelty, new insights)
Provide:
- Summary (2-3 sentences)
- Strengths (bullet list)
- Weaknesses (bullet list, most critical first)
- Questions for authors
- Missing references
- Score (1-6 on NeurIPS scale)
- Confidence (1-5)
After collecting reviews, categorize:
| Priority | Action |
|---|---|
| Critical (technical flaw, missing baseline) | Must fix. May require new experiments → back to Phase 2 |
| High (clarity issue, missing ablation) | Should fix in this revision |
| Medium (minor writing issues, extra experiments) | Fix if time allows |
| Low (style preferences, tangential suggestions) | Note for future work |
For each critical/high issue:
When responding to actual reviews (post-submission), rebuttals are a distinct skill from revision:
Format: Point-by-point. For each reviewer concern:
> R1-W1: "The paper lacks comparison with Method X."
We thank the reviewer for this suggestion. We have added a comparison with
Method X in Table 3 (revised). Our method outperforms X by 3.2pp on [metric]
(p<0.05). We note that X requires 2x our compute budget.
Rules:
latexdiff to generate a marked-up PDF showing changes (see Professional LaTeX Tooling section)What NOT to do: "We respectfully disagree" without evidence. "This is out of scope" without explanation. Ignoring a weakness by only responding to strengths.
Save snapshots at key milestones:
paper/
paper.tex # Current working version
paper_v1_first_draft.tex # First complete draft
paper_v2_post_review.tex # After simulated review
paper_v3_pre_submission.tex # Final before submission
paper_v4_camera_ready.tex # Post-acceptance final
Goal: Final checks, formatting, and submission.
Every venue has mandatory checklists. Complete them carefully — incomplete checklists can result in desk rejection.
See references/checklists.md for:
Double-blind review means reviewers cannot know who wrote the paper. Check ALL of these:
Anonymization Checklist:
- [ ] No author names or affiliations anywhere in the PDF
- [ ] No acknowledgments section (add after acceptance)
- [ ] Self-citations written in third person: "Smith et al. [1] showed..." not "We previously showed [1]..."
- [ ] No GitHub/GitLab URLs pointing to your personal repos
- [ ] Use Anonymous GitHub (https://anonymous.4open.science/) for code links
- [ ] No institutional logos or identifiers in figures
- [ ] No file metadata containing author names (check PDF properties)
- [ ] No "our previous work" or "in our earlier paper" phrasing
- [ ] Dataset names don't reveal institution (rename if needed)
- [ ] Supplementary materials don't contain identifying information
Common mistakes: Git commit messages visible in supplementary code, watermarked figures from institutional tools, acknowledgments left in from a previous draft, arXiv preprint posted before anonymity period.
Pre-Submission Format Check:
- [ ] Page limit respected (excluding references and appendix)
- [ ] All figures are vector (PDF) or high-res raster (600 DPI PNG)
- [ ] All figures readable in grayscale
- [ ] All tables use booktabs
- [ ] References compile correctly (no "?" in citations)
- [ ] No overfull hboxes in critical areas
- [ ] Appendix clearly labeled and separated
- [ ] Required sections present (limitations, broader impact, etc.)
# Clean build
rm -f *.aux *.bbl *.blg *.log *.out *.pdf
latexmk -pdf main.tex
# Or manual
pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex
| Venue | Special Requirements |
|---|---|
| NeurIPS | Paper checklist in appendix, lay summary if accepted |
| ICML | Broader Impact Statement (after conclusion, doesn't count toward limit) |
| ICLR | LLM disclosure required, reciprocal reviewing agreement |
| ACL | Mandatory Limitations section, Responsible NLP checklist |
| AAAI | Strict style file — no modifications whatsoever |
| COLM | Frame contribution for language model community |
When converting between venues, never copy LaTeX preambles between templates:
# 1. Start fresh with target template
cp -r templates/icml2026/ new_submission/
# 2. Copy ONLY content sections (not preamble)
# - Abstract text, section content, figures, tables, bib entries
# 3. Adjust for page limits
# 4. Add venue-specific required sections
# 5. Update references
| From → To | Page Change | Key Adjustments |
|---|---|---|
| NeurIPS → ICML | 9 → 8 | Cut 1 page, add Broader Impact |
| ICML → ICLR | 8 → 9 | Expand experiments, add LLM disclosure |
| NeurIPS → ACL | 9 → 8 | Restructure for NLP conventions, add Limitations |
| ICLR → AAAI | 9 → 7 | Significant cuts, strict style adherence |
| Any → COLM | varies → 9 | Reframe for language model focus |
When cutting pages: move proofs to appendix, condense related work, combine tables, use subfigures. When expanding: add ablations, expand limitations, include additional baselines, add qualitative examples.
After rejection: Address reviewer concerns in the new version, but don't include a "changes" section or reference the previous submission (blind review).
After acceptance, prepare the camera-ready version:
Camera-Ready Checklist:
- [ ] De-anonymize: add author names, affiliations, email addresses
- [ ] Add Acknowledgments section (funding, compute grants, helpful reviewers)
- [ ] Add public code/data URL (real GitHub, not anonymous)
- [ ] Address any mandatory revisions from meta-reviewer
- [ ] Switch template to camera-ready mode (if applicable — e.g., AAAI \anon → \camera)
- [ ] Add copyright notice if required by venue
- [ ] Update any "anonymous" placeholders in text
- [ ] Verify final PDF compiles cleanly
- [ ] Check page limit for camera-ready (sometimes differs from submission)
- [ ] Upload supplementary materials (code, data, appendix) to venue portal
This skill is designed for the Hermes agent. It uses Hermes tools, delegation, scheduling, and memory for the full research lifecycle.
Compose this skill with other Hermes skills for specific phases:
| Skill | When to Use | How to Load |
|---|---|---|
| arxiv | Phase 1 (Literature Review): searching arXiv, generating BibTeX, finding related papers via Semantic Scholar | skill_view("arxiv") |
| subagent-driven-development | Phase 5 (Drafting): parallel section writing with 2-stage review (spec compliance then quality) | skill_view("subagent-driven-development") |
| plan | Phase 0 (Setup): creating structured plans before execution. Writes to .hermes/plans/ | skill_view("plan") |
| qmd | Phase 1 (Literature): searching local knowledge bases (notes, transcripts, docs) via hybrid BM25+vector search | Install: skill_manage("install", "qmd") |
| diagramming | Phase 4-5: creating Excalidraw-based figures and architecture diagrams | skill_view("diagramming") |
| data-science | Phase 4 (Analysis): Jupyter live kernel for interactive analysis and visualization | skill_view("data-science") |
This skill supersedes ml-paper-writing — it contains all of ml-paper-writing's content plus the full experiment/analysis pipeline and autoreason methodology.
| Tool | Usage in This Pipeline |
|---|---|
terminal | LaTeX compilation (latexmk -pdf), git operations, launching experiments (nohup python run.py &), process checks |
process | Background experiment management: process("start", ...), process("poll", pid), process("log", pid), process("kill", pid) |
execute_code | Run Python for citation verification, statistical analysis, data aggregation. Has tool access via RPC. |
read_file / write_file / patch | Paper editing, experiment scripts, result files. Use patch for targeted edits to large .tex files. |
web_search | Literature discovery: web_search("transformer attention mechanism 2024") |
web_extract | Fetch paper content, verify citations: web_extract("https://arxiv.org/abs/2303.17651") |
delegate_task | Parallel section drafting — spawn isolated subagents for each section. Also for concurrent citation verification. |
todo | Primary state tracker across sessions. Update after every phase transition. |
memory | Persist key decisions across sessions: contribution framing, venue choice, reviewer feedback. |
cronjob | Schedule experiment monitoring, deadline countdowns, automated arXiv checks. |
clarify | Ask the user targeted questions when blocked (venue choice, contribution framing). |
send_message | Notify user when experiments complete or drafts are ready, even if user isn't in chat. |
Experiment monitoring (most common):
terminal("ps aux | grep <pattern>")
→ terminal("tail -30 <logfile>")
→ terminal("ls results/")
→ execute_code("analyze results JSON, compute metrics")
→ terminal("git add -A && git commit -m '<descriptive message>' && git push")
→ send_message("Experiment complete: <summary>")
Parallel section drafting (using delegation):
delegate_task("Draft the Methods section based on these experiment scripts and configs.
Include: pseudocode, all hyperparameters, architectural details sufficient for
reproduction. Write in LaTeX using the neurips2025 template conventions.")
delegate_task("Draft the Related Work section. Use web_search and web_extract to
find papers. Verify every citation via Semantic Scholar. Group by methodology.")
delegate_task("Draft the Experiments section. Read all result files in results/.
State which claim each experiment supports. Include error bars and significance.")
Each delegate runs as a fresh subagent with no shared context — provide all necessary information in the prompt. Collect outputs and integrate.
Citation verification (using execute_code):
# In execute_code:
from semanticscholar import SemanticScholar
import requests
sch = SemanticScholar()
results = sch.search_paper("attention mechanism transformers", limit=5)
for paper in results:
doi = paper.externalIds.get('DOI', 'N/A')
if doi != 'N/A':
bibtex = requests.get(f"https://doi.org/{doi}",
headers={"Accept": "application/x-bibtex"}).text
print(bibtex)
memory and todomemory tool — persist key decisions (bounded: ~2200 chars for MEMORY.md):
memory("add", "Paper: autoreason. Venue: NeurIPS 2025 (9 pages).
Contribution: structured refinement works when generation-evaluation gap is wide.
Key results: Haiku 42/42, Sonnet 3/5, S4.6 constrained 2/3.
Status: Phase 5 — drafting Methods section.")
Update memory after major decisions or phase transitions. This persists across sessions.
todo tool — track granular progress:
todo("add", "Design constrained task experiments for Sonnet 4.6")
todo("add", "Run Haiku baseline comparison")
todo("add", "Draft Methods section")
todo("update", id=3, status="in_progress")
todo("update", id=1, status="completed")
Session startup protocol:
1. todo("list") # Check current task list
2. memory("read") # Recall key decisions
3. terminal("git log --oneline -10") # Check recent commits
4. terminal("ps aux | grep python") # Check running experiments
5. terminal("ls results/ | tail -20") # Check for new results
6. Report status to user, ask for direction
cronjobUse the cronjob tool to schedule periodic experiment checks:
cronjob("create", {
"schedule": "*/30 * * * *", # Every 30 minutes
"prompt": "Check experiment status:
1. ps aux | grep run_experiment
2. tail -30 logs/experiment_haiku.log
3. ls results/haiku_baselines/
4. If complete: read results, compute Borda scores,
git add -A && git commit -m 'Add Haiku results' && git push
5. Report: table of results, key finding, next step
6. If nothing changed: respond with [SILENT]"
})
[SILENT] protocol: When nothing has changed since the last check, respond with exactly [SILENT]. This suppresses notification delivery to the user. Only report when there are genuine changes worth knowing about.
Deadline tracking:
cronjob("create", {
"schedule": "0 9 * * *", # Daily at 9am
"prompt": "NeurIPS 2025 deadline: May 22. Today is {date}.
Days remaining: {compute}.
Check todo list — are we on track?
If <7 days: warn user about remaining tasks."
})
When to notify the user (via send_message or direct response):
When NOT to notify:
[SILENT][SILENT]Report format — always include structured data:
## Experiment: <name>
Status: Complete / Running / Failed
| Task | Method A | Method B | Method C |
|------|---------|---------|---------|
| Task 1 | 85.2 | 82.1 | **89.4** |
Key finding: <one sentence>
Next step: <what happens next>
Use clarify for targeted questions when genuinely blocked:
| Decision | When to Ask |
|---|---|
| Target venue | Before starting paper (affects page limits, framing) |
| Contribution framing | When multiple valid framings exist |
| Experiment priority | When TODO list has more experiments than time allows |
| Submission readiness | Before final submission |
Do NOT ask about (be proactive, make a choice, flag it):
Understanding what reviewers look for helps focus effort:
| Criterion | What They Check |
|---|---|
| Quality | Technical soundness, well-supported claims, fair baselines |
| Clarity | Clear writing, reproducible by experts, consistent notation |
| Significance | Community impact, advances understanding |
| Originality | New insights (doesn't require new method) |
Scoring (NeurIPS 6-point scale):
See references/reviewer-guidelines.md for detailed guidelines, common concerns, and rebuttal strategies.
| Issue | Solution |
|---|---|
| Abstract too generic | Delete first sentence if it could prepend any ML paper. Start with your specific contribution. |
| Introduction exceeds 1.5 pages | Split background into Related Work. Front-load contribution bullets. |
| Experiments lack explicit claims | Add: "This experiment tests whether [specific claim]..." before each one. |
| Reviewers find paper hard to follow | Add signposting, use consistent terminology, make figure captions self-contained. |
| Missing statistical significance | Add error bars, number of runs, statistical tests, confidence intervals. |
| Scope creep in experiments | Every experiment must map to a specific claim. Cut experiments that don't. |
| Paper rejected, need to resubmit | See Conference Resubmission in Phase 7. Address reviewer concerns without referencing reviews. |
| Document | Contents |
|---|---|
| references/writing-guide.md | Gopen & Swan 7 principles, Perez micro-tips, Lipton word choice, Steinhardt precision, figure design |
| references/citation-workflow.md | Citation APIs, Python code, CitationManager class, BibTeX management |
| references/checklists.md | NeurIPS 16-item, ICML, ICLR, ACL requirements, universal pre-submission checklist |
| references/reviewer-guidelines.md | Evaluation criteria, scoring, common concerns, rebuttal template |
| references/sources.md | Complete bibliography of all writing guides, conference guidelines, APIs |
| references/experiment-patterns.md | Experiment design patterns, evaluation protocols, monitoring, error recovery |
| references/autoreason-methodology.md | Autoreason loop, strategy selection, model guide, prompts, scope constraints, Borda scoring |
Templates in templates/ for: NeurIPS 2025, ICML 2026, ICLR 2026, ACL, AAAI 2026, COLM 2025.
See templates/README.md for compilation instructions.
Writing Philosophy:
APIs: Semantic Scholar | CrossRef | arXiv