Skill File

Research Paper Writing

Name: Research Paper Writing
Author: XibalbaTechSol

End-to-end pipeline for writing ML/AI research papers — from experiment design through analysis, drafting, revision, and submission. Covers NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Integrates automated experiment monitoring, statistical analysis, iterative writing, and citation verification.

XibalbaTechSol0 starsApr 6, 2026

Occupation
Categories: LLM & AI

Skill Content

Research Paper Writing Pipeline

End-to-end pipeline for producing publication-ready ML/AI research papers targeting NeurIPS, ICML, ICLR, ACL, AAAI, and COLM. This skill covers the full research lifecycle: experiment design, execution, monitoring, analysis, paper writing, review, revision, and submission.

This is not a linear pipeline — it is an iterative loop. Results trigger new experiments. Reviews trigger new analysis. The agent must handle these feedback loops.

┌─────────────────────────────────────────────────────────────┐
│                    RESEARCH PAPER PIPELINE                  │
│                                                             │
│  Phase 0: Project Setup ──► Phase 1: Literature Review      │
│       │                          │                          │
│       ▼                          ▼                          │
│  Phase 2: Experiment     Phase 5: Paper Drafting ◄──┐      │
│       Design                     │                   │      │
│       │                          ▼                   │      │
│       ▼                    Phase 6: Self-Review      │      │
│  Phase 3: Execution &           & Revision ──────────┘      │
│       Monitoring                 │                          │
│       │                          ▼                          │
│       ▼                    Phase 7: Submission               │
│  Phase 4: Analysis ─────► (feeds back to Phase 2 or 5)     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Related Skills

Research Paper Writing | Skills Pool

Skill File

Research Paper Writing

XibalbaTechSol0 starsApr 6, 2026

Occupation
Categories: LLM & AI

Skill Content

Research Paper Writing Pipeline

This is not a linear pipeline — it is an iterative loop. Results trigger new experiments. Reviews trigger new analysis. The agent must handle these feedback loops.

┌─────────────────────────────────────────────────────────────┐
│                    RESEARCH PAPER PIPELINE                  │
│                                                             │
│  Phase 0: Project Setup ──► Phase 1: Literature Review      │
│       │                          │                          │
│       ▼                          ▼                          │
│  Phase 2: Experiment     Phase 5: Paper Drafting ◄──┐      │
│       Design                     │                   │      │
│       │                          ▼                   │      │
│       ▼                    Phase 6: Self-Review      │      │
│  Phase 3: Execution &           & Revision ──────────┘      │
│       Monitoring                 │                          │
│       │                          ▼                          │
│       ▼                    Phase 7: Submission               │
│  Phase 4: Analysis ─────► (feeds back to Phase 2 or 5)     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Related Skills

Confidence Level	Action
High (clear repo, obvious contribution)	Write full draft, deliver, iterate on feedback
Medium (some ambiguity)	Write draft with flagged uncertainties, continue
Low (major unknowns)	Ask 1-2 targeted questions via `clarify`, then draft

Section	Draft Autonomously?	Flag With Draft
Abstract	Yes	"Framed contribution as X — adjust if needed"
Introduction	Yes	"Emphasized problem Y — correct if wrong"
Methods	Yes	"Included details A, B, C — add missing pieces"
Experiments	Yes	"Highlighted results 1, 2, 3 — reorder if needed"
Related Work	Yes	"Cited papers X, Y, Z — add any I missed"

# Understand project structure
ls -la
find . -name "*.py" | head -30
find . -name "*.md" -o -name "*.txt" | xargs grep -l -i "result\|conclusion\|finding"

workspace/
  paper/               # LaTeX source, figures, compiled PDFs
  experiments/         # Experiment runner scripts
  code/                # Core method implementation
  results/             # Raw experiment results (auto-generated)
  tasks/               # Task/benchmark definitions
  human_eval/          # Human evaluation materials (if needed)

git init  # if not already
git remote add origin <repo-url>
git checkout -b paper-draft  # or main

Add Monte Carlo constrained results (5 runs, Sonnet 4.6, policy memo task)
Add Haiku baseline comparison: autoreason vs refinement baselines at cheap model tier

Research Paper TODO:
- [ ] Define one-sentence contribution
- [ ] Literature review (related work + baselines)
- [ ] Design core experiments
- [ ] Run experiments
- [ ] Analyze results
- [ ] Write first draft
- [ ] Self-review (simulate reviewers)
- [ ] Revise based on review
- [ ] Submission prep

# Via terminal:
grep -r "arxiv\|doi\|cite" --include="*.md" --include="*.bib" --include="*.py"
find . -name "*.bib"

# Via web_search:
web_search("[main technique] + [application domain] site:arxiv.org")
web_search("[baseline method] comparison ICML NeurIPS 2024")

# Via web_extract (for specific papers):
web_extract("https://arxiv.org/abs/2303.17651")

Search queries:
- "[main technique] + [application domain]"
- "[baseline method] comparison"
- "[problem name] state-of-the-art"
- Author names from existing citations

claude mcp add exa -- npx -y mcp-remote "https://mcp.exa.ai/mcp"

Citation Verification (MANDATORY per citation):
1. SEARCH → Query Semantic Scholar or Exa MCP with specific keywords
2. VERIFY → Confirm paper exists in 2+ sources (Semantic Scholar + arXiv/CrossRef)
3. RETRIEVE → Get BibTeX via DOI content negotiation (programmatically, not from memory)
4. VALIDATE → Confirm the claim you're citing actually appears in the paper
5. ADD → Add verified BibTeX to bibliography
If ANY step fails → mark as [CITATION NEEDED], inform scientist

# Fetch BibTeX via DOI
import requests

def doi_to_bibtex(doi: str) -> str:
    response = requests.get(
        f"https://doi.org/{doi}",
        headers={"Accept": "application/x-bibtex"}
    )
    response.raise_for_status()
    return response.text

\cite{PLACEHOLDER_author2024_verify_this}  % TODO: Verify this citation exists

Claim	Experiment	Expected Evidence
"Our method outperforms baselines"	Main comparison (Table 1)	Win rate, statistical significance
"Effect is larger for weaker models"	Model scaling study	Monotonic improvement curve
"Convergence requires scope constraints"	Constrained vs unconstrained	Convergence rate comparison

# Save after each problem/task
result_path = f"results/{task}/{strategy}/result.json"
if os.path.exists(result_path):
    continue  # Skip already-completed work
# ... run experiment ...
with open(result_path, 'w') as f:
    json.dump(result, f, indent=2)

results/<experiment>/
  <task>/
    <strategy>/
      final_output.md          # Final result
      history.json             # Full trajectory
      pass_01/                 # Per-iteration artifacts
        version_a.md
        version_b.md
        critic.md

run_experiment.py              # Core experiment runner
run_baselines.py               # Baseline comparison
run_comparison_judge.py        # Blind evaluation
analyze_results.py             # Statistical analysis
make_charts.py                 # Visualization

nohup python run_experiment.py --config config.yaml > logs/experiment_01.log 2>&1 &
echo $!  # Record the PID

Monitor Prompt Template:
1. Check if process is still running: ps aux | grep <pattern>
2. Read last 30 lines of log: tail -30 <logfile>
3. Check for completed results: ls <result_dir>
4. If results exist, read and report: cat <result_file>
5. If all done, commit: git add -A && git commit -m "<descriptive message>" && git push
6. Report in structured format (tables with key metrics)
7. Answer the key analytical question for this experiment

Failure	Detection	Recovery
API rate limit / credit exhaustion	402/429 errors in logs	Wait, then re-run (scripts skip completed work)
Process crash	PID gone, incomplete results	Re-run from last checkpoint
Timeout on hard problems	Process stuck, no log progress	Kill and skip, note in results
Wrong model ID	Errors referencing model name	Fix ID and re-run

git add -A
git commit -m "Add <experiment name>: <key finding in 1 line>"
git push

# Standard analysis pattern
import json, os
from pathlib import Path

results = {}
for result_file in Path("results/").rglob("result.json"):
    data = json.loads(result_file.read_text())
    strategy = result_file.parent.name
    task = result_file.parent.parent.name
    results.setdefault(strategy, {})[task] = data

# Compute aggregate metrics
for strategy, tasks in results.items():
    scores = [t["score"] for t in tasks.values()]
    print(f"{strategy}: mean={np.mean(scores):.1f}, std={np.std(scores):.1f}")

\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}

Situation	Action
Core claims supported, results significant	Move to Phase 5 (writing)
Results inconclusive, need more data	Back to Phase 2 (design)
Unexpected finding suggests new direction	Back to Phase 2 (design)
Missing one ablation reviewers will ask for	Run it, then Phase 5
All experiments done but some failed	Note failures, move to Phase 5

Your Situation	Strategy	Why
Mid-tier model + constrained task	Autoreason	Sweet spot. Generation-evaluation gap is widest. Baselines actively destroy weak model outputs.
Mid-tier model + open task	Autoreason with scope constraints added	Add fixed facts, structure, or deliverable to bound the improvement space.
Frontier model + constrained task	Autoreason	Wins 2/3 constrained tasks even at frontier.
Frontier model + unconstrained task	Critique-and-revise or single pass	Autoreason comes last. Model self-evaluates well enough.
Concrete technical task (system design)	Critique-and-revise	Direct find-and-fix loop is more efficient.
Template-filling task (one correct structure)	Single pass or conservative	Minimal decision space. Iteration adds no value.
Code with test cases	Autoreason (code variant)	Structured analysis of why it failed before fixing. Recovery rate 62% vs 43%.
Very weak model (Llama 8B class)	Single pass	Model too weak for diverse candidates. Invest in generation quality.

Model Tier        │ Generation │ Self-Eval │ Gap    │ Autoreason Value
──────────────────┼────────────┼───────────┼────────┼─────────────────
Weak (Llama 8B)   │ Poor       │ Poor      │ Small  │ None — can't generate diverse candidates
Mid (Haiku 3.5)   │ Decent     │ Poor      │ LARGE  │ MAXIMUM — 42/42 perfect Borda
Mid (Gemini Flash)│ Decent     │ Moderate  │ Large  │ High — wins 2/3
Strong (Sonnet 4) │ Good       │ Decent    │ Medium │ Moderate — wins 3/5
Frontier (S4.6)   │ Excellent  │ Good      │ Small  │ Only with constraints

Failure	Detection	Fix
No convergence (A never wins)	A wins <15% over 20+ passes	Add scope constraints to the task
Synthesis drift	Word counts grow unboundedly	Constrain structure and deliverable
Degradation below single pass	Baselines score higher than iterated output	Switch to single pass; model may be too weak
Overfitting (code)	High public-test pass, low private-test pass	Use structured analysis, not just test feedback
Broken judges	Parsing failures reduce panel below 3	Fix parser before continuing

Pillar	Description	Test
The What	1-3 specific novel claims	Can you state them in one sentence?
The Why	Rigorous empirical evidence	Do experiments distinguish your hypothesis from alternatives?
The So What	Why readers should care	Does this connect to a recognized community problem?

Paper Writing Checklist:
- [ ] Step 1: Define the one-sentence contribution
- [ ] Step 2: Draft Figure 1 (core idea or most compelling result)
- [ ] Step 3: Draft abstract (5-sentence formula)
- [ ] Step 4: Draft introduction (1-1.5 pages max)
- [ ] Step 5: Draft methods
- [ ] Step 6: Draft experiments & results
- [ ] Step 7: Draft related work
- [ ] Step 8: Draft conclusion & discussion
- [ ] Step 9: Draft limitations (REQUIRED by all venues)
- [ ] Step 10: Plan appendix (proofs, extra experiments, details)
- [ ] Step 11: Complete paper checklist
- [ ] Step 12: Final review

1. What you achieved: "We introduce...", "We prove...", "We demonstrate..."
2. Why this is hard and important
3. How you do it (with specialist keywords for discoverability)
4. What evidence you have
5. Your most remarkable number/result

Figure 1 Type	When to Use	Example
Method diagram	New architecture or pipeline	TikZ flowchart showing your system
Results teaser	One compelling result tells the whole story	Bar chart: "Ours vs baselines" with clear gap
Problem illustration	The problem is unintuitive	Before/after showing failure mode you fix
Conceptual diagram	Abstract contribution needs visual grounding	2x2 matrix of method properties

Appendix Section	What Goes Here
Proofs & Derivations	Full proofs too long for main text. Main text can state theorems with "proof in Appendix A."
Additional Experiments	Ablations, scaling curves, per-dataset breakdowns, hyperparameter sensitivity
Implementation Details	Full hyperparameter tables, training details, hardware specs, random seeds
Dataset Documentation	Data collection process, annotation guidelines, licensing, preprocessing
Prompts & Templates	Exact prompts used (for LLM-based methods), evaluation templates
Human Evaluation	Annotation interface screenshots, instructions given to annotators, IRB details
Additional Figures	Per-task breakdowns, trajectory visualizations, failure case examples

Cut Strategy	Saves	Risk
Move proofs to appendix	0.5-2 pages	Low — standard practice
Condense related work	0.5-1 page	Medium — may miss key citations
Combine tables with subfigures	0.25-0.5 page	Low — often improves readability
Use `\vspace{-Xpt}` sparingly	0.1-0.3 page	Low if subtle, high if obvious
Remove qualitative examples	0.5-1 page	Medium — reviewers like examples
Reduce figure sizes	0.25-0.5 page	High — figures must remain readable

Principle	Rule
Subject-verb proximity	Keep subject and verb close
Stress position	Place emphasis at sentence ends
Topic position	Put context first, new info after
Old before new	Familiar info → unfamiliar info
One unit, one function	Each paragraph makes one point
Action in verb	Use verbs, not nominalizations
Context before new	Set stage before presenting

Template Setup Checklist:
- [ ] Step 1: Copy entire template directory to new project
- [ ] Step 2: Verify template compiles as-is (before any changes)
- [ ] Step 3: Read the template's example content to understand structure
- [ ] Step 4: Replace example content section by section
- [ ] Step 5: Use template macros (check preamble for \newcommand definitions)
- [ ] Step 6: Clean up template artifacts only at the end

cp -r templates/neurips2025/ ~/papers/my-paper/
cd ~/papers/my-paper/
ls -la  # Should see: main.tex, neurips.sty, Makefile, etc.

latexmk -pdf main.tex
# Or manual: pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex

% Template example (keep for reference):
% \begin{figure}[t]
%   \centering
%   \includegraphics[width=0.8\linewidth]{example-image}
%   \caption{Template shows caption style}
% \end{figure}

% Your actual figure:
\begin{figure}[t]
  \centering
  \includegraphics[width=0.8\linewidth]{your-figure.pdf}
  \caption{Your caption following the same style.}
\end{figure}

\newcommand{\method}{YourMethodName}  % Consistent method naming
\newcommand{\eg}{e.g.,\xspace}        % Proper abbreviations
\newcommand{\ie}{i.e.,\xspace}

Pitfall	Problem	Solution
Copying only `.tex` file	Missing `.sty`, won't compile	Copy entire directory
Modifying `.sty` files	Breaks conference formatting	Never edit style files
Adding random packages	Conflicts, breaks template	Only add if necessary
Deleting template content early	Lose formatting reference	Keep as comments until done
Not compiling frequently	Errors accumulate	Compile after each section
Raster PNGs for figures	Blurry in paper	Always use vector PDF via `savefig('fig.pdf')`

Conference	Main File	Style File	Page Limit
NeurIPS 2025	`main.tex`	`neurips.sty`	9 pages
ICML 2026	`example_paper.tex`	`icml2026.sty`	8 pages
ICLR 2026	`iclr2026_conference.tex`	`iclr2026_conference.sty`	9 pages
ACL 2025	`acl_latex.tex`	`acl.sty`	8 pages (long)
AAAI 2026	`aaai2026-unified-template.tex`	`aaai2026.sty`	7 pages
COLM 2025	`colm2025_conference.tex`	`colm2025_conference.sty`	9 pages

\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}

% --- Professional Packages (add after conference style file) ---

% Typography
\usepackage{microtype}              % Microtypographic improvements (protrusion, expansion)
                                     % Makes text noticeably more polished — always include

% Tables
\usepackage{booktabs}               % Professional table rules (\toprule, \midrule, \bottomrule)
\usepackage{siunitx}                % Consistent number formatting, decimal alignment
                                     % Usage: \num{12345} → 12,345; \SI{3.5}{GHz} → 3.5 GHz
                                     % Table alignment: S column type for decimal-aligned numbers

% Figures
\usepackage{graphicx}               % Include graphics (\includegraphics)
\usepackage{subcaption}             % Subfigures with (a), (b), (c) labels
                                     % Usage: \begin{subfigure}{0.48\textwidth} ... \end{subfigure}

% Diagrams and Algorithms
\usepackage{tikz}                   % Programmable vector diagrams
\usetikzlibrary{arrows.meta, positioning, shapes.geometric, calc, fit, backgrounds}
\usepackage[ruled,vlined]{algorithm2e}  % Professional pseudocode
                                     % Alternative: \usepackage{algorithmicx} if template bundles it

% Cross-references
\usepackage{cleveref}               % Smart references: \cref{fig:x} → "Figure 1"
                                     % MUST be loaded AFTER hyperref
                                     % Handles: figures, tables, sections, equations, algorithms

% Math (usually included by conference .sty, but verify)
\usepackage{amsmath,amssymb}        % AMS math environments and symbols
\usepackage{mathtools}              % Extends amsmath (dcases, coloneqq, etc.)

% Colors (for figures and diagrams)
\usepackage{xcolor}                 % Color management
% Okabe-Ito colorblind-safe palette:
\definecolor{okblue}{HTML}{0072B2}
\definecolor{okorange}{HTML}{E69F00}
\definecolor{okgreen}{HTML}{009E73}
\definecolor{okred}{HTML}{D55E00}
\definecolor{okpurple}{HTML}{CC79A7}
\definecolor{okcyan}{HTML}{56B4E9}
\definecolor{okyellow}{HTML}{F0E442}

\begin{tabular}{l S[table-format=2.1] S[table-format=2.1] S[table-format=2.1]}
\toprule
Method & {Accuracy $\uparrow$} & {F1 $\uparrow$} & {Latency (ms) $\downarrow$} \\
\midrule
Baseline         & 85.2  & 83.7  & 45.3 \\
Ablation (no X)  & 87.1  & 85.4  & 42.1 \\
\textbf{Ours}    & \textbf{92.1} & \textbf{90.8} & \textbf{38.7} \\
\bottomrule
\end{tabular}

\begin{figure}[t]
  \centering
  \begin{subfigure}[b]{0.48\textwidth}
    \centering
    \includegraphics[width=\textwidth]{fig_results_a.pdf}
    \caption{Results on Dataset A.}
    \label{fig:results-a}
  \end{subfigure}
  \hfill
  \begin{subfigure}[b]{0.48\textwidth}
    \centering
    \includegraphics[width=\textwidth]{fig_results_b.pdf}
    \caption{Results on Dataset B.}
    \label{fig:results-b}
  \end{subfigure}
  \caption{Comparison of our method across two datasets. (a) shows the scaling
  behavior and (b) shows the ablation results. Both use 5 random seeds.}
  \label{fig:results}
\end{figure}

\begin{algorithm}[t]
\caption{Iterative Refinement with Judge Panel}
\label{alg:method}
\KwIn{Task $T$, model $M$, judges $J_1 \ldots J_n$, convergence threshold $k$}
\KwOut{Final output $A^*$}
$A \gets M(T)$ \tcp*{Initial generation}
$\text{streak} \gets 0$\;
\While{$\text{streak} < k$}{
  $C \gets \text{Critic}(A, T)$ \tcp*{Identify weaknesses}
  $B \gets M(T, C)$ \tcp*{Revised version addressing critique}
  $AB \gets \text{Synthesize}(A, B)$ \tcp*{Merge best elements}
  \ForEach{judge $J_i$}{
    $\text{rank}_i \gets J_i(\text{shuffle}(A, B, AB))$ \tcp*{Blind ranking}
  }
  $\text{winner} \gets \text{BordaCount}(\text{ranks})$\;
  \eIf{$\text{winner} = A$}{
    $\text{streak} \gets \text{streak} + 1$\;
  }{
    $A \gets \text{winner}$; $\text{streak} \gets 0$\;
  }
}
\Return{$A$}\;
\end{algorithm}

\begin{figure}[t]
\centering
\begin{tikzpicture}[
  node distance=1.8cm,
  box/.style={rectangle, draw, rounded corners, minimum height=1cm, 
              minimum width=2cm, align=center, font=\small},
  arrow/.style={-{Stealth[length=3mm]}, thick},
]
  \node[box, fill=okcyan!20] (input) {Input\\$x$};
  \node[box, fill=okblue!20, right of=input] (encoder) {Encoder\\$f_\theta$};
  \node[box, fill=okgreen!20, right of=encoder] (latent) {Latent\\$z$};
  \node[box, fill=okorange!20, right of=latent] (decoder) {Decoder\\$g_\phi$};
  \node[box, fill=okred!20, right of=decoder] (output) {Output\\$\hat{x}$};
  
  \draw[arrow] (input) -- (encoder);
  \draw[arrow] (encoder) -- (latent);
  \draw[arrow] (latent) -- (decoder);
  \draw[arrow] (decoder) -- (output);
\end{tikzpicture}
\caption{Architecture overview. The encoder maps input $x$ to latent 
representation $z$, which the decoder reconstructs.}
\label{fig:architecture}
\end{figure}

\begin{tikzpicture}[
  cell/.style={rectangle, draw, minimum width=2.5cm, minimum height=1cm, 
               align=center, font=\small},
  header/.style={cell, fill=gray!20, font=\small\bfseries},
]
  % Headers
  \node[header] at (0, 0) {Method};
  \node[header] at (3, 0) {Converges?};
  \node[header] at (6, 0) {Quality?};
  % Rows
  \node[cell] at (0, -1) {Single Pass};
  \node[cell, fill=okgreen!15] at (3, -1) {N/A};
  \node[cell, fill=okorange!15] at (6, -1) {Baseline};
  \node[cell] at (0, -2) {Critique+Revise};
  \node[cell, fill=okred!15] at (3, -2) {No};
  \node[cell, fill=okred!15] at (6, -2) {Degrades};
  \node[cell] at (0, -3) {Ours};
  \node[cell, fill=okgreen!15] at (3, -3) {Yes ($k$=2)};
  \node[cell, fill=okgreen!15] at (6, -3) {Improves};
\end{tikzpicture}

\begin{tikzpicture}[
  node distance=2cm,
  box/.style={rectangle, draw, rounded corners, minimum height=0.8cm, 
              minimum width=1.8cm, align=center, font=\small},
  arrow/.style={-{Stealth[length=3mm]}, thick},
  label/.style={font=\scriptsize, midway, above},
]
  \node[box, fill=okblue!20] (gen) {Generator};
  \node[box, fill=okred!20, right=2.5cm of gen] (critic) {Critic};
  \node[box, fill=okgreen!20, below=1.5cm of $(gen)!0.5!(critic)$] (judge) {Judge Panel};
  
  \draw[arrow] (gen) -- node[label] {output $A$} (critic);
  \draw[arrow] (critic) -- node[label, right] {critique $C$} (judge);
  \draw[arrow] (judge) -| node[label, left, pos=0.3] {winner} (gen);
\end{tikzpicture}

# Install
# macOS: brew install latexdiff (or comes with TeX Live)
# Linux: sudo apt install latexdiff

# Generate diff
latexdiff paper_v1.tex paper_v2.tex > paper_diff.tex
pdflatex paper_diff.tex

# For multi-file projects (with \input{} or \include{})
latexdiff --flatten paper_v1.tex paper_v2.tex > paper_diff.tex

pip install SciencePlots

import matplotlib.pyplot as plt
import scienceplots  # registers styles

# Use science style (IEEE-like, clean)
with plt.style.context(['science', 'no-latex']):
    fig, ax = plt.subplots(figsize=(3.5, 2.5))  # Single-column width
    ax.plot(x, y, label='Ours', color='#0072B2')
    ax.plot(x, y2, label='Baseline', color='#D55E00', linestyle='--')
    ax.set_xlabel('Training Steps')
    ax.set_ylabel('Accuracy')
    ax.legend()
    fig.savefig('paper/fig_results.pdf', bbox_inches='tight')

# Available styles: 'science', 'ieee', 'nature', 'science+ieee'
# Add 'no-latex' if LaTeX is not installed on the machine generating plots

You are an expert reviewer for [VENUE]. Review this paper according to the 
official reviewer guidelines. Evaluate:

1. Quality (technical soundness, baselines, claims supported by evidence)
2. Clarity (writing, notation consistency, reproducibility)
3. Significance (impact, importance of the problem)
4. Originality (novelty, new insights)

Provide:
- Summary (2-3 sentences)
- Strengths (bullet list)
- Weaknesses (bullet list, most critical first)
- Questions for authors
- Missing references
- Score (1-6 on NeurIPS scale)
- Confidence (1-5)

Priority	Action
Critical (technical flaw, missing baseline)	Must fix. May require new experiments → back to Phase 2
High (clarity issue, missing ablation)	Should fix in this revision
Medium (minor writing issues, extra experiments)	Fix if time allows
Low (style preferences, tangential suggestions)	Note for future work

> R1-W1: "The paper lacks comparison with Method X."

We thank the reviewer for this suggestion. We have added a comparison with 
Method X in Table 3 (revised). Our method outperforms X by 3.2pp on [metric] 
(p<0.05). We note that X requires 2x our compute budget.

paper/
  paper.tex                    # Current working version
  paper_v1_first_draft.tex     # First complete draft
  paper_v2_post_review.tex     # After simulated review
  paper_v3_pre_submission.tex  # Final before submission
  paper_v4_camera_ready.tex    # Post-acceptance final

Anonymization Checklist:
- [ ] No author names or affiliations anywhere in the PDF
- [ ] No acknowledgments section (add after acceptance)
- [ ] Self-citations written in third person: "Smith et al. [1] showed..." not "We previously showed [1]..."
- [ ] No GitHub/GitLab URLs pointing to your personal repos
- [ ] Use Anonymous GitHub (https://anonymous.4open.science/) for code links
- [ ] No institutional logos or identifiers in figures
- [ ] No file metadata containing author names (check PDF properties)
- [ ] No "our previous work" or "in our earlier paper" phrasing
- [ ] Dataset names don't reveal institution (rename if needed)
- [ ] Supplementary materials don't contain identifying information

Pre-Submission Format Check:
- [ ] Page limit respected (excluding references and appendix)
- [ ] All figures are vector (PDF) or high-res raster (600 DPI PNG)
- [ ] All figures readable in grayscale
- [ ] All tables use booktabs
- [ ] References compile correctly (no "?" in citations)
- [ ] No overfull hboxes in critical areas
- [ ] Appendix clearly labeled and separated
- [ ] Required sections present (limitations, broader impact, etc.)

# Clean build
rm -f *.aux *.bbl *.blg *.log *.out *.pdf
latexmk -pdf main.tex

# Or manual
pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex

Venue	Special Requirements
NeurIPS	Paper checklist in appendix, lay summary if accepted
ICML	Broader Impact Statement (after conclusion, doesn't count toward limit)
ICLR	LLM disclosure required, reciprocal reviewing agreement
ACL	Mandatory Limitations section, Responsible NLP checklist
AAAI	Strict style file — no modifications whatsoever
COLM	Frame contribution for language model community

# 1. Start fresh with target template
cp -r templates/icml2026/ new_submission/

# 2. Copy ONLY content sections (not preamble)
#    - Abstract text, section content, figures, tables, bib entries

# 3. Adjust for page limits
# 4. Add venue-specific required sections
# 5. Update references

From → To	Page Change	Key Adjustments
NeurIPS → ICML	9 → 8	Cut 1 page, add Broader Impact
ICML → ICLR	8 → 9	Expand experiments, add LLM disclosure
NeurIPS → ACL	9 → 8	Restructure for NLP conventions, add Limitations
ICLR → AAAI	9 → 7	Significant cuts, strict style adherence
Any → COLM	varies → 9	Reframe for language model focus

Camera-Ready Checklist:
- [ ] De-anonymize: add author names, affiliations, email addresses
- [ ] Add Acknowledgments section (funding, compute grants, helpful reviewers)
- [ ] Add public code/data URL (real GitHub, not anonymous)
- [ ] Address any mandatory revisions from meta-reviewer
- [ ] Switch template to camera-ready mode (if applicable — e.g., AAAI \anon → \camera)
- [ ] Add copyright notice if required by venue
- [ ] Update any "anonymous" placeholders in text
- [ ] Verify final PDF compiles cleanly
- [ ] Check page limit for camera-ready (sometimes differs from submission)
- [ ] Upload supplementary materials (code, data, appendix) to venue portal

Skill	When to Use	How to Load
arxiv	Phase 1 (Literature Review): searching arXiv, generating BibTeX, finding related papers via Semantic Scholar	`skill_view("arxiv")`
subagent-driven-development	Phase 5 (Drafting): parallel section writing with 2-stage review (spec compliance then quality)	`skill_view("subagent-driven-development")`
plan	Phase 0 (Setup): creating structured plans before execution. Writes to `.hermes/plans/`	`skill_view("plan")`
qmd	Phase 1 (Literature): searching local knowledge bases (notes, transcripts, docs) via hybrid BM25+vector search	Install: `skill_manage("install", "qmd")`
diagramming	Phase 4-5: creating Excalidraw-based figures and architecture diagrams	`skill_view("diagramming")`
data-science	Phase 4 (Analysis): Jupyter live kernel for interactive analysis and visualization	`skill_view("data-science")`

Tool	Usage in This Pipeline
`terminal`	LaTeX compilation (`latexmk -pdf`), git operations, launching experiments (`nohup python run.py &`), process checks
`process`	Background experiment management: `process("start", ...)`, `process("poll", pid)`, `process("log", pid)`, `process("kill", pid)`
`execute_code`	Run Python for citation verification, statistical analysis, data aggregation. Has tool access via RPC.
`read_file` / `write_file` / `patch`	Paper editing, experiment scripts, result files. Use `patch` for targeted edits to large .tex files.
`web_search`	Literature discovery: `web_search("transformer attention mechanism 2024")`
`web_extract`	Fetch paper content, verify citations: `web_extract("https://arxiv.org/abs/2303.17651")`
`delegate_task`	Parallel section drafting — spawn isolated subagents for each section. Also for concurrent citation verification.
`todo`	Primary state tracker across sessions. Update after every phase transition.
`memory`	Persist key decisions across sessions: contribution framing, venue choice, reviewer feedback.
`cronjob`	Schedule experiment monitoring, deadline countdowns, automated arXiv checks.
`clarify`	Ask the user targeted questions when blocked (venue choice, contribution framing).
`send_message`	Notify user when experiments complete or drafts are ready, even if user isn't in chat.

terminal("ps aux | grep <pattern>")
→ terminal("tail -30 <logfile>")
→ terminal("ls results/")
→ execute_code("analyze results JSON, compute metrics")
→ terminal("git add -A && git commit -m '<descriptive message>' && git push")
→ send_message("Experiment complete: <summary>")

delegate_task("Draft the Methods section based on these experiment scripts and configs. 
  Include: pseudocode, all hyperparameters, architectural details sufficient for 
  reproduction. Write in LaTeX using the neurips2025 template conventions.")

delegate_task("Draft the Related Work section. Use web_search and web_extract to 
  find papers. Verify every citation via Semantic Scholar. Group by methodology.")

delegate_task("Draft the Experiments section. Read all result files in results/. 
  State which claim each experiment supports. Include error bars and significance.")

# In execute_code:
from semanticscholar import SemanticScholar
import requests

sch = SemanticScholar()
results = sch.search_paper("attention mechanism transformers", limit=5)
for paper in results:
    doi = paper.externalIds.get('DOI', 'N/A')
    if doi != 'N/A':
        bibtex = requests.get(f"https://doi.org/{doi}", 
                              headers={"Accept": "application/x-bibtex"}).text
        print(bibtex)

memory("add", "Paper: autoreason. Venue: NeurIPS 2025 (9 pages). 
  Contribution: structured refinement works when generation-evaluation gap is wide.
  Key results: Haiku 42/42, Sonnet 3/5, S4.6 constrained 2/3.
  Status: Phase 5 — drafting Methods section.")

todo("add", "Design constrained task experiments for Sonnet 4.6")
todo("add", "Run Haiku baseline comparison")
todo("add", "Draft Methods section")
todo("update", id=3, status="in_progress")
todo("update", id=1, status="completed")

1. todo("list")                           # Check current task list
2. memory("read")                         # Recall key decisions
3. terminal("git log --oneline -10")      # Check recent commits
4. terminal("ps aux | grep python")       # Check running experiments
5. terminal("ls results/ | tail -20")     # Check for new results
6. Report status to user, ask for direction

cronjob("create", {
  "schedule": "*/30 * * * *",  # Every 30 minutes
  "prompt": "Check experiment status:
    1. ps aux | grep run_experiment
    2. tail -30 logs/experiment_haiku.log
    3. ls results/haiku_baselines/
    4. If complete: read results, compute Borda scores, 
       git add -A && git commit -m 'Add Haiku results' && git push
    5. Report: table of results, key finding, next step
    6. If nothing changed: respond with [SILENT]"
})

cronjob("create", {
  "schedule": "0 9 * * *",  # Daily at 9am
  "prompt": "NeurIPS 2025 deadline: May 22. Today is {date}. 
    Days remaining: {compute}. 
    Check todo list — are we on track? 
    If <7 days: warn user about remaining tasks."
})

## Experiment: <name>
Status: Complete / Running / Failed

| Task | Method A | Method B | Method C |
|------|---------|---------|---------|
| Task 1 | 85.2 | 82.1 | **89.4** |

Key finding: <one sentence>
Next step: <what happens next>

Decision	When to Ask
Target venue	Before starting paper (affects page limits, framing)
Contribution framing	When multiple valid framings exist
Experiment priority	When TODO list has more experiments than time allows
Submission readiness	Before final submission

Criterion	What They Check
Quality	Technical soundness, well-supported claims, fair baselines
Clarity	Clear writing, reproducible by experts, consistent notation
Significance	Community impact, advances understanding
Originality	New insights (doesn't require new method)

Issue	Solution
Abstract too generic	Delete first sentence if it could prepend any ML paper. Start with your specific contribution.
Introduction exceeds 1.5 pages	Split background into Related Work. Front-load contribution bullets.
Experiments lack explicit claims	Add: "This experiment tests whether [specific claim]..." before each one.
Reviewers find paper hard to follow	Add signposting, use consistent terminology, make figure captions self-contained.
Missing statistical significance	Add error bars, number of runs, statistical tests, confidence intervals.
Scope creep in experiments	Every experiment must map to a specific claim. Cut experiments that don't.
Paper rejected, need to resubmit	See Conference Resubmission in Phase 7. Address reviewer concerns without referencing reviews.

Document	Contents
references/writing-guide.md	Gopen & Swan 7 principles, Perez micro-tips, Lipton word choice, Steinhardt precision, figure design
references/citation-workflow.md	Citation APIs, Python code, CitationManager class, BibTeX management
references/checklists.md	NeurIPS 16-item, ICML, ICLR, ACL requirements, universal pre-submission checklist
references/reviewer-guidelines.md	Evaluation criteria, scoring, common concerns, rebuttal template
references/sources.md	Complete bibliography of all writing guides, conference guidelines, APIs
references/experiment-patterns.md	Experiment design patterns, evaluation protocols, monitoring, error recovery
references/autoreason-methodology.md	Autoreason loop, strategy selection, model guide, prompts, scope constraints, Borda scoring

Research Paper Writing

Research Paper Writing Pipeline

Research Paper Writing

Research Paper Writing Pipeline

When To Use This Skill

Core Philosophy

Proactivity and Collaboration

Phase 0: Project Setup

Step 0.1: Explore the Repository

Step 0.2: Organize the Workspace

Step 0.3: Set Up Version Control

Step 0.4: Identify the Contribution

Step 0.5: Create a TODO List

Phase 1: Literature Review

Step 1.1: Identify Seed Papers

Step 1.2: Search for Related Work

Step 1.3: Verify Every Citation

Step 1.4: Organize Related Work

Phase 2: Experiment Design

Step 2.1: Map Claims to Experiments

Step 2.2: Design Baselines

Step 2.3: Define Evaluation Protocol

Step 2.4: Write Experiment Scripts

Phase 3: Experiment Execution & Monitoring

Step 3.1: Launch Experiments

Step 3.2: Set Up Monitoring (Cron Pattern)

Step 3.3: Handle Failures

Step 3.4: Commit Completed Results

Phase 4: Result Analysis

Step 4.1: Aggregate Results

Step 4.2: Statistical Significance

Step 4.3: Identify the Story

Step 4.4: Create Figures and Tables

Step 4.5: Decide: More Experiments or Write?

Iterative Refinement: Strategy Selection

Quick Decision Table

The Generation-Evaluation Gap

Autoreason Loop (Summary)

Applying to Paper Drafts

Failure Modes

Phase 5: Paper Drafting

The Narrative Principle

Time Allocation

Writing Workflow

Step 5.0: Title

Step 5.1: Abstract (5-Sentence Formula)

Step 5.2: Figure 1

Step 5.3: Introduction (1-1.5 pages max)

Step 5.3: Methods

Step 5.4: Experiments & Results

Step 5.5: Related Work

Step 5.6: Limitations (REQUIRED)

Step 5.7: Conclusion & Discussion

Step 5.8: Appendix Strategy

Page Budget Management

Writing Style

Using LaTeX Templates

Template Pitfalls

Quick Template Reference

Tables and Figures

Conference Resubmission

Professional LaTeX Preamble

siunitx Table Alignment

Subfigures

Pseudocode with algorithm2e

TikZ Diagram Patterns

latexdiff for Revision Tracking

SciencePlots for matplotlib

Phase 6: Self-Review & Revision

Step 6.1: Simulate Reviews

Step 6.2: Prioritize Feedback

Step 6.3: Revision Cycle

Step 6.4: Rebuttal Writing

Step 6.5: Paper Evolution Tracking

Phase 7: Submission Preparation

Step 7.1: Conference Checklist

Step 7.2: Anonymization Checklist

Step 7.3: Formatting Verification

Step 7.3: Final Compilation

Step 7.4: Conference-Specific Requirements

State Management with `memory` and `todo`

Cron Monitoring with `cronjob`