Autonomously improve a generated paper via GPT-5.4 xhigh review → implement fixes → recompile, for 2 rounds. Use when user says "改论文", "improve paper", "论文润色循环", "auto improve", or wants to iteratively polish a generated paper.
Autonomously improve the paper at: $ARGUMENTS
This skill is designed to run after Workflow 3 (/paper-plan → /paper-figure → /paper-write → /paper-compile). It takes a compiled paper and iteratively improves it through external LLM review.
Unlike /auto-review-loop (which iterates on research — running experiments, collecting data, rewriting narrative), this skill iterates on paper writing quality — fixing theoretical inconsistencies, softening overclaims, adding missing content, and improving presentation.
gpt-5.4 — Model used via Codex MCP for paper review.true, every review round uses a fresh mcp__codex__codex thread with no prior review context. Never use mcp__codex__codex-reply for review rounds. Set to false only for deliberate debugging of the legacy behavior. Empirical evidence (April 2026): running the same paper with codex-reply + "since last round we did X" prompts inflated scores from real 3/10 → fake 8/10 across 5 rounds; switching to fresh threads recovered the true 3/10 assessment.PAPER_IMPROVEMENT_LOG.md — Cumulative log of all rounds, stored in paper directory.true, pause after each round's review and present score + weaknesses to the user. The user can approve fixes, provide custom modification instructions, skip specific fixes, or stop early. When false (default), runs fully autonomously.💡 Override:
/auto-paper-improvement-loop "paper/" — human checkpoint: true
paper/main.pdf + LaTeX source files.tex files — concatenated for review promptIf the context window fills up mid-loop, Claude Code auto-compacts. To recover, this skill writes PAPER_IMPROVEMENT_STATE.json after each round:
{
"current_round": 1,
"threadId": "019ce736-...",
"last_score": 6,
"status": "in_progress",
"timestamp": "2026-03-13T21:00:00"
}
On startup: if PAPER_IMPROVEMENT_STATE.json exists with "status": "in_progress" AND timestamp is within 24 hours, read it + PAPER_IMPROVEMENT_LOG.md to recover context, then resume from the next round. Otherwise (file absent, "status": "completed", or older than 24 hours), start fresh.
After each round: overwrite the state file. On completion: set "status": "completed".
The reviewer must be context-naive on every round. Prior-round summaries, fix lists, and executor explanations are not evidence; they are a source of confirmation bias. If the reviewer is told what changed, scores tend to drift upward even when the manuscript itself has not materially improved.
Rules:
mcp__codex__codex, not mcp__codex__codex-reply..tex source and compiled PDF.Set REVIEWER_BIAS_GUARD = false only if you explicitly want the legacy, context-carrying behavior for debugging.
cp paper/main.pdf paper/main_round0_original.pdf
Concatenate all section files into a single text block for the review prompt:
# Collect all sections in order
for f in paper/sections/*.tex; do
echo "% === $(basename $f) ==="
cat "$f"
done > /tmp/paper_full_text.txt
Send the full paper text AND compiled PDF to GPT-5.4 xhigh:
mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are reviewing a [VENUE] paper. Please provide a detailed, structured review.
## Paper Files:
- LaTeX source: [list all section .tex files]
- Compiled PDF: paper/main.pdf
- Figures: [list figure files]
Read BOTH the LaTeX source (for content/logic) AND the compiled PDF (for visual presentation).
## Review Instructions
Please act as a senior ML reviewer ([VENUE] level). Provide:
1. **Overall Score** (1-10, where 6 = weak accept, 7 = accept)
2. **Summary** (2-3 sentences)
3. **Strengths** (bullet list, ranked)
4. **Weaknesses** (bullet list, ranked: CRITICAL > MAJOR > MINOR)
5. **For each CRITICAL/MAJOR weakness**: A specific, actionable fix
6. **Missing References** (if any)
7. **Visual Review** (from the PDF):
- Figure quality: readable? labels legible? colors distinguishable in grayscale?
- Figure-caption alignment: does each caption match its figure?
- Layout: orphaned headers, awkward page breaks, figures far from references?
- Table formatting: aligned columns, consistent decimals, bold for best results?
- Visual consistency: same color scheme across all figures?
8. **Verdict**: Ready for submission? Yes / Almost / No
Focus on: theoretical rigor, claims vs evidence alignment, writing clarity,
self-containedness, notation consistency, AND visual presentation quality.
Save the threadId for Round 2.
Skip if HUMAN_CHECKPOINT = false.
Present the review results and wait for user input:
📋 Round 1 review complete.
Score: X/10 — [verdict]
Key weaknesses (by severity):
1. [CRITICAL] ...
2. [MAJOR] ...
3. [MINOR] ...
Reply "go" to implement all fixes, give custom instructions, "skip 2" to skip specific fixes, or "stop" to end.
Parse user response same as /auto-review-loop: approve / custom instructions / skip / stop.
Parse the review and implement fixes by severity:
Priority order:
Common fix patterns:
| Issue | Fix Pattern |
|---|---|
| Assumption-model mismatch | Rewrite assumption to match the model, add formal proposition bridging the gap |
| Overclaims | Soften language: "validate" → "demonstrate practical relevance", "comparable" → "qualitatively competitive" |
| Missing metrics | Add quantitative table with honest parameter counts and caveats |
| Theorem not self-contained | Add "Interpretation" paragraph listing all dependencies |
| Notation confusion | Rename conflicting symbols globally, add Notation paragraph |
| Missing references | Add to references.bib, cite in appropriate locations |
| Theory-practice gap | Explicitly frame theory as idealized; add synthetic validation subsection |
| Proof gap (theory papers) | Run /proof-checker if PROOF_AUDIT.md doesn't exist yet; fix FATAL/CRITICAL issues |
| Writing clutter / passive voice | Apply sciwrite 5-pass audit: clutter extraction → active voice → sentence architecture → keyword consistency → numerical integrity. See paper-write Step 5 |
| Number mismatch (paper vs results) | Run /paper-claim-audit if PAPER_CLAIM_AUDIT.md doesn't exist; fix any number_mismatch or aggregation_mismatch claims |
| Keyword inconsistency | The "Banana Rule": if Methods says "obese group", Results must not say "heavier group". Extract key terms, verify consistency across all sections |
cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round1.pdf
Verify: 0 undefined references, 0 undefined citations.
After every recompilation, rerun a theorem-statement consistency check so fix rounds cannot reintroduce appendix drift. Run this after Step 4 and again after Step 7 before the final format check.
Scope
main.tex input order: files before \appendix are main body; files after \appendix are appendix.Normalized comparison logic
\label{...}, \ref{...}, \eqref{...}, \cite...{...}, and whitespace-only differences.\emph{}, \textbf{}, \textit{}, \mathrm{}, \mathbf{}, \mathcal{}, and \operatorname{} to their contents.stationary vs terminal) as regression drift.python3 - <<'PY'
import re
def normalize(s):
s = re.sub(r'%.*', '', s)
s = re.sub(r'\\label\{[^}]*\}', '', s)
s = re.sub(r'\\(?:ref|eqref|cref|Cref|cite[a-zA-Z]*)\{[^}]*\}', '', s)
s = re.sub(r'\\(?:emph|textbf|textit|mathrm|mathbf|mathsf|mathcal|operatorname)\{([^{}]*)\}', r'\1', s)
s = re.sub(r'\\begin\{[^}]+\}|\\end\{[^}]+\}', '', s)
s = re.sub(r'\s+', ' ', s)
return s.strip().lower()
# Compare normalized theorem blocks from the current main-body files
# against their appendix restatements. Any mismatch blocks completion.
PY
Empirical motivation: in our April 2026 NeurIPS run, thm:dsm-oracle had a 3-case split (w=0/1/>1) in main but no case split in appendix; nu_T was named "stationary" in main and "terminal" in appendix. These drifted multiple times across fix rounds because no automated check caught regression.
If REVIEWER_BIAS_GUARD = true (default), use a fresh mcp__codex__codex thread for Round 2. Do not reuse the Round 1 threadId for prompting. Save the returned threadId only for recovery bookkeeping.
mcp__codex__codex:
model: gpt-5.4
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are reviewing a [VENUE] paper. This is a fresh, zero-context review.
Ignore any prior review rounds, prior fix lists, or executor explanations.
Judge the paper only from the current LaTeX source and compiled PDF.
## Paper Files:
- LaTeX source: [list all section .tex files]
- Compiled PDF: paper/main.pdf
- Figures: [list figure files]
Read BOTH the LaTeX source (for content/logic) AND the compiled PDF (for visual presentation).
## Review Instructions
Please act as a senior ML reviewer ([VENUE] level). Provide:
1. **Overall Score** (1-10, where 6 = weak accept, 7 = accept)
2. **Summary** (2-3 sentences)
3. **Strengths** (bullet list, ranked)
4. **Weaknesses** (bullet list, ranked: CRITICAL > MAJOR > MINOR)
5. **For each CRITICAL/MAJOR weakness**: A specific, actionable fix
6. **Missing References** (if any)
7. **Visual Review** (from the PDF):
- Figure quality: readable? labels legible? colors distinguishable in grayscale?
- Figure-caption alignment: does each caption match its figure?
- Layout: orphaned headers, awkward page breaks, figures far from references?
- Table formatting: aligned columns, consistent decimals, bold for best results?
- Visual consistency: same color scheme across all figures?
8. **Verdict**: Ready for submission? Yes / Almost / No
Focus on: theoretical rigor, claims vs evidence alignment, writing clarity,
self-containedness, notation consistency, and visual presentation quality.
If REVIEWER_BIAS_GUARD = false (legacy debugging only), use mcp__codex__codex-reply with the saved threadId; this is not the recommended path.
Run this only if the paper is theory-heavy (≥5 \begin{theorem}|\begin{lemma}|\begin{proposition}|\begin{corollary} environments in the source) and only on the final scheduled round (current_round == MAX_ROUNDS).
This is a late-stage adversarial check. It must always use fresh mcp__codex__codex threads, never codex-reply, and it must not reuse any prior review context.
Thread 1: Attack
Thread 2: Defense
Merge rule
PAPER_IMPROVEMENT_LOG.md.HUMAN_CHECKPOINT = true, include the merged findings in the checkpoint summary before asking the user to proceed.This phase feeds directly into Step 6. The attack/defense findings must be merged before the final recompile.
Empirical motivation: in our April 2026 NeurIPS run, after 5 rounds of standard improvement (score 7-8/10), the kill-argument exercise surfaced framing weaknesses that no prior review caught (e.g., "width-w is mostly conditional", "CRF irrelevant to real D-LLMs"). Author rebuttal forced explicit scope qualifications in abstract and discussion.
Skip if HUMAN_CHECKPOINT = false. Same as Step 2b — present Round 2 review, wait for user input.
Same process as Step 3. Typical Round 2 fixes:
cd paper && latexmk -C && latexmk -pdf -interaction=nonstopmode -halt-on-error main.tex
cp main.pdf main_round2.pdf
After the final recompilation, run a location-aware format compliance check.
# If the log lacks file/line data, rerun the final compile once with -file-line-error.
cd paper && latexmk -pdf -file-line-error -interaction=nonstopmode -halt-on-error main.tex
# 1. Page count vs venue limit
PAGES=$(pdfinfo paper/main.pdf | grep Pages | awk '{print $2}')
echo "Pages: $PAGES (limit: 9 main body for ICLR/NeurIPS)"
# 2. Duplicate labels: HARD BLOCK
DUP_LABELS=$(grep -Rho "\\\\label{[^}]*}" paper/main.tex paper/sections 2>/dev/null | sort | uniq -d || true)
if [ -n "$DUP_LABELS" ]; then
echo "Duplicate labels found (BLOCKING):"
echo "$DUP_LABELS"
fi
# 3. Overfull warnings with location classification
OVERFULLS=$(grep -n "Overfull \\\\hbox" paper/main.log 2>/dev/null || true)
# Main body = source files before \appendix in main.tex.
# Appendix = source files after \appendix, or files whose path contains "appendix".
# Bibliography = paper.bbl, references.bib, or bibliography-generated output.
MAIN_BODY_OVERFULL=$(echo "$OVERFULLS" | grep -v -E 'appendix|paper\.bbl|references\.bib' || true)
APPENDIX_OVERFULL=$(echo "$OVERFULLS" | grep -E 'appendix' || true)
BIB_OVERFULL=$(echo "$OVERFULLS" | grep -E 'paper\.bbl|references\.bib' || true)
echo "Main-body overfulls (any size BLOCKS):"
echo "$MAIN_BODY_OVERFULL"
echo "Appendix overfulls (>10pt blocks):"
echo "$APPENDIX_OVERFULL"
echo "Bibliography overfulls (>20pt blocks):"
echo "$BIB_OVERFULL"
Stop criteria:
Auto-fix patterns (location-aware):
| Issue | Fix |
|---|---|
| Main-body overfull in equation | Split with aligned / split / multline, or shorten notation |
| Main-body overfull in table | Reduce font, resize table, or break table across rows |
| Main-body overfull in text | Rephrase; do not hide it with global \sloppy |
| Appendix overfull ≤ 10pt | Warn only unless visibly clipping |
| Appendix overfull > 10pt | Apply the same fix if the spill is visible |
| Bibliography overfull ≤ 20pt | Warn only unless caused by malformed entry or clipping |
| Bibliography overfull > 20pt | Fix malformed entry, URL, or DOI formatting |
| Over page limit | Move content to appendix, compress tables, reduce figure sizes |
Location-aware interpretation:
-file-line-error log.Empirical motivation: in our April 2026 NeurIPS run, 28+ overfull hbox warnings (largest 160pt in the appendix bridge proof) survived 5 improvement rounds because the previous blanket "overfull > 10pt blocks" rule was too lax and treated all locations equally.
Create PAPER_IMPROVEMENT_LOG.md in the paper directory:
# Paper Improvement Log
## Score Progression
| Round | Score | Verdict | Key Changes |
|-------|-------|---------|-------------|
| Round 0 (original) | X/10 | No/Almost/Yes | Baseline |
| Round 1 | Y/10 | No/Almost/Yes | [summary of fixes] |
| Round 2 | Z/10 | No/Almost/Yes | [summary of fixes] |
## Round 1 Review & Fixes
<details>
<summary>GPT-5.4 xhigh Review (Round 1)</summary>
[Full raw review text, verbatim]
</details>
### Fixes Implemented
1. [Fix description]
2. [Fix description]
...
## Round 2 Review & Fixes
<details>
<summary>GPT-5.4 xhigh Review (Round 2)</summary>
[Full raw review text, verbatim]
</details>
### Fixes Implemented
1. [Fix description]
2. [Fix description]
...
## PDFs
- `main_round0_original.pdf` — Original generated paper
- `main_round1.pdf` — After Round 1 fixes
- `main_round2.pdf` — Final version after Round 2 fixes
Report to user:
After each round's review AND at final completion, check ~/.claude/feishu.json:
review_scored — "Round N: X/10 — [key changes]"pipeline_done — score progression table + final page count"off": skip entirely (no-op)paper/
├── main_round0_original.pdf # Original
├── main_round1.pdf # After Round 1
├── main_round2.pdf # After Round 2 (final)
├── main.pdf # = main_round2.pdf
└── PAPER_IMPROVEMENT_LOG.md # Full review log with scores
Large file handling: If the Write tool fails due to file size, immediately retry using Bash (cat << 'EOF' > file) to write in chunks. Do NOT ask the user for permission — just do it silently.
Preserve all PDF versions — user needs to compare progression
Save FULL raw review text — do not summarize or truncate GPT-5.4 responses
Reviewer independence (Round 2+): when REVIEWER_BIAS_GUARD = true (default), use a fresh mcp__codex__codex thread for every review round; never use mcp__codex__codex-reply and never include "since last round" / fix summaries in the prompt. See the Reviewer Independence Protocol section above.
Always recompile after fixes — verify 0 errors before proceeding
Do not fabricate experimental results — synthetic validation must describe methodology, not invent numbers
Respect the paper's claims — soften overclaims rather than adding unsupported new claims
Global consistency — when renaming notation or softening claims, check ALL files (abstract, intro, method, experiments, theory sections, conclusion, tables, figure captions)
Based on end-to-end testing on a 9-page ICLR 2026 theory paper:
| Round | Score | Key Improvements |
|---|---|---|
| Round 0 | 4/10 (content) | Baseline: assumption-model mismatch, overclaims, notation issues |
| Round 1 | 6/10 (content) | Fixed assumptions, softened claims, added interpretation, renamed notation |
| Round 2 | 7/10 (content) | Added synthetic validation, formal truncation proposition, stronger limitations |
| Round 3 | 5→8.5/10 (format) | Removed hero fig, appendix, compressed conclusion, fixed overfull hbox |
+4.5 points across 3 rounds (2 content + 1 format) is typical for a well-structured but rough first draft. Final: 8 pages main body, 0 overfull hbox, ICLR-compliant.
After each mcp__codex__codex or mcp__codex__codex-reply reviewer call, save the trace following shared-references/review-tracing.md. Use tools/save_trace.sh or write files directly to .aris/traces/<skill>/<date>_run<NN>/. Respect the --- trace: parameter (default: full).