Run the four paper-quality autoraters from PaperOrchestra (arXiv:2604.05018, App. F.3) — Citation F1 (P0/P1 partition + Precision/Recall/F1), Literature Review Quality (6-axis 0-100 with anti-inflation rules), SxS Overall Paper Quality (side-by-side), and SxS Literature Review Quality (side-by-side). TRIGGER when the user asks to "score this paper draft", "evaluate against the benchmark", "compare two papers", or "run the autoraters".
Faithful implementation of the four LLM-as-judge autoraters used in PaperOrchestra (Song et al., 2026, arXiv:2604.05018, §5 and App. F.3).
These are the metrics the paper uses to demonstrate that PaperOrchestra beats single-agent and AI-Scientist-v2 baselines. Use them to:
| Autorater | What it does | Inputs | Output |
|---|---|---|---|
| Citation F1 — P0/P1 partition | Partitions reference list into P0 (must-cite) and P1 (good-to-cite) given the paper text | one paper text + its references list | JSON {ref_num: "P0"|"P1"} |
| Literature Review Quality | 6-axis 0-100 score for Intro+Related Work, with anti-inflation hard caps |
| one paper PDF/text + reference avg citation count |
JSON with axis_scores, penalties, summary, overall_score |
| SxS Overall Paper Quality | Holistic side-by-side preference judgment | two papers (PDF or text) | JSON with winner ∈ {paper_1, paper_2, tie} |
| SxS Literature Review Quality | Side-by-side preference, Intro+Related Work only | two papers | JSON with winner ∈ {paper_1, paper_2, tie} |
The paper uses Gemini-3.1-Pro and GPT-5 as judges, set to temperature 0.0 (Gemini) or default 1.0 (GPT-5, which doesn't allow temperature adjustment). Use whatever your host LLM is.
This is a two-step procedure:
For both the ground-truth paper AND the generated paper, run the LLM with
references/citation-f1-prompt.md: