Academic research lifecycle for AI/ML papers — discover, deep-read, discuss/brainstorm, cite (BibTeX via DBLP→CrossRef→S2), write paper sections. Integrates Semantic Scholar / arXiv / AlphaXiv / OpenReview / DBLP, runs venue-quality gates (CCF, impact factor), and auto-injects expert knowledge from 85 bundled AI/ML domain skills across 21 categories (architectures, fine-tuning, RAG, inference serving, alignment, interpretability, evaluation, MoE, long-context, multimodal, etc.) — users do NOT invoke those directly; this skill routes to them. Trigger on "find papers on…", "what's new in…", "let's brainstorm research…", "is this paper any good?", "give me the bibtex", "help write related work", "reviews for <paper>", "citation graph of…". Not for product/framework how-tos, rewriting a doc into paper format, cleaning messy BibTeX, or vendor blogs / whitepapers / tutorials.
Full academic research lifecycle: discover, discuss, read, cite, write.
On every invocation, before doing anything else:
phases/skill-router.md for domain skill mapping. Parse any --domain or --domain-only flags from user input.Parse user intent from /research <args> and route to the appropriate phase module.
| Input pattern | Phase | Module |
|---|---|---|
/research discover "topic" | discover (consolidated) | phases/discover.md |
/research discuss | discuss (current session) | phases/discuss.md |
/research discuss <paper> | discuss (from specific paper) | phases/discuss.md |
/research read <paper> | read (standalone) | phases/read.md |
/research cite <paper> | cite | phases/cite.md |
/research write <section> | write | phases/write.md |
| Ambiguous input | Ask user to clarify | — |
<paper> accepts: arXiv ID, DOI, or paper title (with clarify flow if ambiguous). See Unified Input Parsing section below.
All commands support optional --domain <categories> or --domain-only <categories> flags. See Unified Input Parsing section for details.
On first invocation, create .research-workspace/ in the current working directory if it doesn't exist:
mkdir -p .research-workspace/sessions
echo '{"sessions": [], "current_session": null}' > .research-workspace/state.json
Each discover invocation creates a session: .research-workspace/sessions/{topic-slug}-{date}/
Contents:
discover.json — search results with verdicts + landscape summarydiscuss/brief.json — research brief from discuss phaseread/{paper_id}.json — structured paper analysescite/{paper_id}.bib — verified BibTeX entriescite/cite-log.json — citation metadata and sourceswrite/{section}.md — generated section text with metadata (output format, citations used, review gates applied)cache/{paper_id}/ — raw paper content cache (see Paper Cache)checkpoints/ — phase completion checkpoints (see State Persistence)Fetched paper content is expensive (network latency, rate limits, AlphaXiv/arXiv availability). After context compaction, all prior reads are lost. The cache stores raw content locally so it never needs to be re-fetched.
.research-workspace/sessions/{slug}/cache/{paper_id}/
├── overview.md # AlphaXiv structured overview (if available)
├── fulltext.md # AlphaXiv full text (if available)
├── paper.pdf # arXiv or publisher PDF (if downloadable)
├── supplementary.pdf # Supplementary/appendix PDF (if found)
├── openreview/
│ ├── reviews.json # OpenReview official reviews (if venue uses OpenReview)
│ ├── rebuttal.json # Author rebuttals (if available)
│ ├── meta_review.json # AC/SAC meta-review (if available)
│ ├── decision.json # Accept/reject decision (if available)
│ └── discussion.json # Threaded replies: author comments, reviewer follow-ups, ethics reviews
└── cache_meta.json # What was cached, when, from where
{
"paper_id": "s2_id or arxiv_id",
"arxiv_id": "2401.12345",
"doi": "10.xxxx/...",
"cached_at": "ISO 8601",
"contents": {
"overview": { "source": "alphaxiv", "status": "cached|404|not_attempted" },
"fulltext": { "source": "alphaxiv", "status": "cached|404|not_attempted" },
"pdf": { "source": "arxiv|publisher|s2_open_access", "status": "cached|404|not_attempted" },
"supplementary": { "source": "publisher|arxiv", "status": "cached|404|not_attempted" },
"openreview": { "source": "openreview_api", "status": "cached|not_found|private|auth_failed|no_credentials|error|not_attempted", "venue": "ICLR 2024" }
}
}
On any paper content fetch (read, discover quick-read, discuss knowledge gap):
.research-workspace/sessions/{slug}/cache/{paper_id}/. If cache_meta.json exists, read it to determine what's available.status: "cached", read from the local file. Do not re-fetch.status is "not_attempted" or the cache directory doesn't exist, fetch from the network and save to cache.status: "404"), do not retry — the content genuinely doesn't exist. Exception: retry after 7 days (AlphaXiv may add new papers).For papers published at venues that use OpenReview, attempt to fetch review records. This is a best-effort enhancement — all other features work without it.
Authentication required: The OpenReview API v2 (https://api2.openreview.net) requires a Bearer token. Obtain via POST /login with username/password. If no credentials are configured, skip entirely (log warning, set status: "no_credentials"). If /login returns mfaPending, log warning and skip (MFA accounts are not supported — they need an interactive challenge).
Credentials config: Set OPENREVIEW_USER and OPENREVIEW_PASS in .claude/settings.json under "env".
Fetch flow (when credentials are available — run every step with a short Python one-liner via python3 -c "...", using urllib.request and json.loads; no external deps):
POST https://api2.openreview.net/login with Content-Type: application/json and body {"id": "<OPENREVIEW_USER>", "password": "<OPENREVIEW_PASS>"}. Parse response JSON — if mfaPending is truthy, set status: "auth_failed" and stop (MFA not supported). Otherwise extract token (valid 24h). Keep the token in memory for subsequent requests; if it must be persisted, use a tempfile.NamedTemporaryFile(mode='w', delete=False) with os.chmod(path, 0o600) and clean up on exit.GET https://api2.openreview.net/notes/search?term=<url_encoded_title>&source=forum&limit=3 with Authorization: Bearer <token>. Match by normalized title equality (lowercase, strip whitespace). If the venue ID is known, add &content.venueid=<venue_id> to narrow results.GET https://api2.openreview.net/notes?forum=<forum_id>&limit=1000 with the same Bearer header. Classify notes using a dual-signal approach — check invitations[] first (strongest signal when present), then fall back to signatures + content keys (needed because invitation is often null for published papers):
invitation ending in Official_Review OR (signatures contains Reviewer_* AND content has review-like keys: summary, strengths, weaknesses, rating, confidence, soundness — field names vary by venue, e.g., review, main_review, recommendation)invitation ending in Meta_Review OR (signatures contains Area_Chair_*/Senior_Area_Chair_* AND content has metareview key)invitation ending in Decision OR (signatures contains Program_Chairs AND content has decision key)signatures contains Authors AND replyto points to a note already classified as a reviewreplyto equal to the forum ID are direct replies. Notes with replyto pointing to another note are threaded discussion.note.content.{field}.value (not note.content.{field} directly). For example, rating at note.content.rating.value (e.g., "8: accept, good paper"), decision at note.content.decision.value (e.g., "Accept (poster)"). Review text may span multiple fields depending on venue — common keys include summary, strengths, weaknesses, review, main_review, comment, recommendation. Extract all content keys, don't hardcode a single field. Some fields may also have note.content.{field}.readers controlling visibility. Caution: API responses may contain raw control characters (tabs, newlines in review text). Parse with json.loads(text, strict=False); never strip control chars from stored artifacts (only do so for disposable human-readable inspection).reviews.json, rebuttal.json, meta_review.json, decision.json, discussion.json to cache/{paper_id}/openreview/.status: "auth_failed", log warningstatus: "private". Do NOT freeze as permanent — reviews may become public after camera-readystatus: "not_found"status: "error"| Phase | Cache interaction |
|---|---|
| read Step 2 | Check cache before AlphaXiv/arXiv/publisher fetch. Save all fetched content to cache. Also attempt OpenReview fetch for the paper. |
| discover Step 6 (quick-read) | Check cache for overview.md. Save if fetched. Skip full OpenReview fetch (too heavy for batch quick-read). |
| discuss Phase 3 (knowledge gap) | Check cache before quick-read fetch. Save if fetched. |
| discuss Phase 5 (reviewer simulation) | If openreview/reviews.json exists in cache for any analyzed paper, incorporate real reviewer concerns into the simulation — real objections take priority over simulated ones. |
| write (related-work, intro) | Read cached content for positioning accuracy instead of relying on summaries alone. |
Long research sessions (especially discover → discuss → write chains) risk losing progress to context compaction. Each phase writes a checkpoint on completion so work can be resumed.
Save to .research-workspace/sessions/{slug}/checkpoints/{phase}_{timestamp}.json:
{
"phase": "discover|discuss|read|cite|write",
"status": "completed|in_progress|failed",
"timestamp": "ISO 8601",
"completed_steps": ["step1", "step2"],
"pending_steps": ["step3"],
"key_artifacts": {
"discover_json": "relative path if exists",
"brief_json": "relative path if exists",
"read_analyses": ["paper_id1", "paper_id2"],
"cite_log": "relative path if exists"
},
"context_summary": "1-2 sentence summary of what was accomplished and what remains",
"skills_loaded": ["clip"],
"user_decisions": ["chose direction A over B", "skipped experiment design"]
}
Each phase writes its checkpoint after saving its primary artifact (discover.json, brief.json, etc.). The checkpoint is a lightweight pointer — the real data lives in the phase artifacts.
If a session resumes after context compaction (detected by: user continues a /research command but prior conversation context is unavailable):
.research-workspace/state.json → identify current sessioncheckpoints/ → reconstruct phase stateskills_loaded listpending_steps — do not re-run completed stepsThe discuss phase is uniquely long-running (multi-turn). It writes incremental checkpoints at two points:
These capture the evolving research brief so that even if compaction occurs mid-discussion, the accumulated findings, open problems, and proposed directions are preserved.
Phases that accept a paper identifier (discuss, read, cite) share this logic. Discover takes a topic description, not a paper identifier.
2401.12345): Direct lookup via s2_match.py or S2 API10.1109/...): Direct CrossRef/S2 lookups2_match.py "<text>" for exact title matchs2_search.py "<text>" 5 + dblp_search.py "<text>" 5When free-text search returns multiple candidates, present each with:
venue_info.py, citation count from S2 metadata)User selects one → proceed to the requested phase.
All commands support:
--domain <cat1,cat2>: Additive — merge with auto-detected categories--domain-only <cat1,cat2>: Exclusive — use only these categoriesCategory names match the skill-router mapping table (semantic match OK).
Each parallel search agent has a 60-second timeout. If an agent times out or errors, proceed with results from the remaining agents. Log the failure but do not block.
superpowers:verification-before-completion before presenting final output in cite, write, and discover phases (see each phase for details)superpowers:systematic-debugging)When a retrieval task has no directly relevant results after the applicable primary searches, classify the failure before escalating:
| Failure type | Meaning | Action |
|---|---|---|
| Zero results | Query returned nothing from primary sources | Escalate through strategy ladder |
| Exact-match miss | Title/DOI/arXiv ID lookup failed | Check alternate titles (preprint vs. published), arXiv ID variants, DOI redirects |
| Metadata mismatch | Found paper but required fields missing | Try alternate source in priority order: DBLP → CrossRef → S2 (per Iron Rule #2) |
| Indexing lag | Paper too new for database | Check arXiv directly, AlphaXiv MCP retrieval, or accept "not yet indexed" |
| Query drift | Results returned but none directly relevant | Tighten query specificity, add field-specific keywords, filter by year |
| Version drift | Preprint title/content differs from published version | Search both titles, check DOI and arXiv ID separately |
| Timeout or rate limit | Operational failure | Retry once after delay, then proceed with remaining sources |
| Source outage | API down | Skip source, log warning, proceed with remaining sources |
Work through these strategies in order. Stop when directly relevant results are found and verified enough for the current phase. Earlier strategies are cheaper; later strategies cast a wider net.
s2_match.py for precise titles; DOI/arXiv ID direct lookups2_search.py → s2_bulk_search.py → DBLP → CrossRef title searchs2_citations.py, s2_references.py, s2_recommend.pys2_snippet.py for method names or claims not in titles/abstracts| Trigger condition | Typical phase |
|---|---|
| All applicable primary searches return 0 results for a non-trivial query (S2 + AlphaXiv in discover; DBLP + CrossRef + S2 in cite) | discover Step 3-4, cite Step 2 |
| 2+ consecutive API timeouts or HTTP errors across any scripts | any phase |
| Knowledge-gap search in discuss cannot find the referenced method/baseline | discuss Phase 3 |
>30% of \cite{} references fail verification during write | write Step 4 |
After exhausting the ladder, or when a strategy succeeds, produce an attempt ledger:
{
"query_original": "...",
"failure_type": "zero_results|exact_match_miss|metadata_mismatch|indexing_lag|query_drift|version_drift|timeout_or_rate_limit|source_outage",
"attempts": [
{
"strategy": "normalize|exact_match|adjust_year|decompose|switch_mode|graph_search|body_search|adjacent_fields",
"query": "expanded query text",
"source": "s2_search|s2_bulk|s2_match|s2_citations|s2_references|s2_recommend|s2_snippet|dblp_search|dblp_bibtex|arxiv_bibtex|crossref_search|doi2bibtex|alphaxiv_agentic|alphaxiv_full_text|alphaxiv_embedding",
"result_count": 0,
"status": "matched|zero_results|irrelevant|timeout|rate_limited|source_down|error",
"notes": "optional: error details, HTTP status, why results were irrelevant"
}
],
"final_outcome": "found|not_found_after_exhaustive_search|genuine_null|blocked_by_operational_failure",
"resolved_by": "normalize|exact_match|adjust_year|decompose|switch_mode|graph_search|body_search|adjacent_fields|null"
}
"Not found after exhaustive search" is a valid, honest outcome. Some papers are not indexed, some methods have no prior work, and some queries target genuinely unexplored territory. Never fabricate papers or cite from model memory to avoid a null result.
Use a second LLM (via Codex MCP: mcp__codex__codex) throughout the lifecycle. The purpose is to surface blind spots that a single model misses and to create a three-way discussion (user + Claude + Codex) for richer exploration.
Codex participates in three modes:
| # | Mode | Phase | What to send | What to ask |
|---|---|---|---|---|
| 1 | Co-thinker | discuss Phase 2 (Assumption Surfacing) | Papers analyzed + field context from discover | "What assumptions does this field take for granted? What would break if each assumption were violated?" |
| 2 | Co-thinker | discuss Phase 3 (Discussion Loop) | Current discussion state (latest findings, open questions, proposed angles) | Phase 3-specific: varies per turn. See discuss.md for integration details. |
| 3 | Adversarial | discuss Phase 4 (Adversarial Novelty Check) | Proposed direction + closest existing work | "As a skeptical reviewer at {target venue}: (1) Is the claimed novelty real or superficial? (2) What existing work was missed? (3) What's the strongest argument against this direction?" |
| 4 | Adversarial | discuss Phase 5 (Reviewer Simulation) | Proposed direction + research brief so far | "Generate 3-4 specific reviewer objections. For each: the weakest claim, the missing baseline, the essential ablation, and severity (High/Medium/Low)." |
| 5 | Adversarial | discuss Phase 6 (Significance Test) | Proposed direction + significance analysis from Claude | "Evaluate this direction on three tiers: (1) real-world impact with concrete failure modes, (2) would the community think differently if this succeeds, (3) expected improvement magnitude vs. SOTA. Flag any tier that is weak." |
| 6 | Cold reader | discuss Phase 7 (Simplicity Test) | User's 2-sentence explanation ONLY — no research brief, no context | "Based only on these 2 sentences, explain back what the research idea is and what makes it novel. What is unclear or ambiguous?" |
| 7 | Adversarial | discuss Phase 8 (Experiment Design) | Completed experiment plan draft + research brief | "What baselines are missing? What essential ablation is not listed? Are there better-suited datasets? Is the expected results table realistic?" |
| 8 | Adversarial | discuss Phase 9 (Convergence Decision) | Complete research brief | "Given everything in this brief, would you recommend this direction for {target venue}? What is the single biggest risk? What would make you abandon this direction?" |
| 9 | Adversarial | write Step 5.5 (abstract + intro only) | Draft text + research brief | "As an AC at {target venue}: (1) Does the motivation hold up? (2) Is the contribution clearly distinguished from prior work? (3) What would make you desk-reject this?" |
| 10 | Adversarial | write related-work | Draft related-work section + discover results + read analyses | "As a reviewer: (1) What important related work is missing? (2) Is any prior work mischaracterized or unfairly compared? (3) Is the positioning of our contribution honest and precise?" |
mcp__codex__codex with the prompt.[Codex]. The user synthesizes both.[Cross-model review].In the Discussion Loop, Codex participates as an ongoing third voice. The interaction model:
Not every turn requires Codex. Use judgment:
Every delegated task — spawned agent, Codex review call, or subsearch dispatch — must follow a 6-element brief. This prevents vague tasking, bloated context, inconsistent outputs, and scope drift.
GOAL — What decision this task informs; why it matters to the current phase
DELIVERABLE — Exact artifact to produce (format, fields, structure)
EVIDENCE — Allowed evidence sources (which papers, which APIs, what context to send)
CONSTRAINTS — Result caps, token budget, timeout
DONE WHEN — Acceptance criteria (how to verify the output is correct and complete)
EXCLUSIONS — Forbidden behavior (what to exclude, what NOT to do)
Spawned search subagents (discover Step 3, dispatched via superpowers:dispatching-parallel-agents):
GOAL: Find papers relevant to "{topic}" for landscape analysis
DELIVERABLE: JSON objects, each with: paper_id, title, year, venue, citations, doi, arxiv_id, authors, source
EVIDENCE: Agent 1 (S2 subagent) — S2 API only (s2_search.py, s2_bulk_search.py)
Agent 2 (AlphaXiv subagent) — three claude.ai-bound AlphaXiv MCP tools (agentic_paper_retrieval, full_text_papers_search, embedding_similarity_search), issued as three parallel tool calls in a single subagent message. Subagents inherit the top-level session's MCP bindings (empirically verified).
CONSTRAINTS: Top 20 results for S2; ~10 per AlphaXiv tool; 60-second timeout per subagent
DONE WHEN: ≥1 result returned with all required fields populated; exit 0
EXCLUSIONS: Don't analyze, rank, or summarize papers — just retrieve and return raw results. Don't cross sources.
Codex review calls (all 10 integration points):
GOAL: {varies — e.g., "Surface assumptions this field takes for granted" or "Find weaknesses in this research direction for {venue}"}
DELIVERABLE: Structured response: numbered findings, each with a concrete claim and evidence
EVIDENCE: Only the artifacts listed in the "What to send" column of the invocation table — never the full conversation history
CONSTRAINTS: 3-5 actionable findings; no filler
DONE WHEN: Each finding is specific enough to act on (names a paper, identifies a gap, flags a weakness)
EXCLUSIONS: Don't restate what was sent; don't make stylistic suggestions; don't hedge with "this could be strengthened" — say what's wrong and why
Knowledge-gap subsearches (discuss Phase 3):
GOAL: Fill knowledge gap identified during discussion: "{specific method/baseline/claim}"
DELIVERABLE: 1-3 relevant papers with title, year, venue, and one-sentence summary of relevance
EVIDENCE: S2 search + DBLP search; use the exact method/baseline name as query
CONSTRAINTS: Top 3 results; 30-second timeout per source
DONE WHEN: At least one paper found that addresses the gap, or explicit "not found after exhaustive search"
EXCLUSIONS: Don't fabricate papers; don't use model memory; don't return tangentially related work
Write-phase review gates (Triple Review Gate + Codex):
GOAL: {varies — e.g., "As an AC at {venue}, evaluate whether this abstract/intro would survive desk review"}
DELIVERABLE: 2-3 specific revision suggestions, each with: the problematic text, what's wrong, and a concrete fix direction
EVIDENCE: Only the draft section text + research brief — not the full paper or conversation
CONSTRAINTS: Focus on substance (motivation, contribution clarity, positioning) not prose style
DONE WHEN: Each suggestion identifies a specific passage and explains why it's a problem
EXCLUSIONS: Don't praise what works; don't suggest word-level edits; don't repeat the Iron Rules back
Not every delegation needs full formality. Skip the brief for:
venue_info.py, ccf_lookup.py) — these have fixed I/OAll output in English. For uncommon vocabulary (GRE-level), add Chinese translation in parentheses.
All scripts are in skills/research/scripts/. Key scripts:
| Script | Purpose |
|---|---|
s2_search.py | S2 relevance-ranked semantic search |
s2_bulk_search.py | S2 boolean bulk search with year filtering |
s2_batch.py | S2 batch metadata by paper IDs (NOT a search) |
s2_citations.py | Papers that cited a given paper |
s2_references.py | Papers cited by a given paper |
s2_recommend.py | Paper recommendations from positive/negative examples |
s2_snippet.py | Search within paper bodies for specific passages |
s2_match.py | Exact title match (single result) |
dblp_search.py | DBLP publication search |
dblp_bibtex.py | Fetch condensed BibTeX via DBLP search API (title + author + year) |
arxiv_bibtex.py | Fetch @misc BibTeX from arxiv.org (arXiv ID) |
crossref_search.py | CrossRef search (fallback) |
doi2bibtex.py | DOI → BibTeX via content negotiation |
| Script | Purpose |
|---|---|
venue_info.py | Venue quality summary (CCF + IF + quartile) |
ccf_lookup.py | CCF ranking lookup |
if_lookup.py | Impact factor lookup |
author_info.py | Author h-index and stats |
| Script | Purpose |
|---|---|
init.py | Rate limit helpers, DBLP host fallback |
Orchestra-Research/AI-Research-SKILLs — bundled in vendor/ai-research-skills/ and not registered as standalone skills (only /research is exposed). The skill router loads them via Read on demand. Includes ml-paper-writing, brainstorming-research-ideas, creative-thinking-for-research, and all 21 domain categories. See phases/skill-router.md § "Loading Vendor Skills" for the load mechanism.humanizer skill — style review for write phasesuperpowers:dispatching-parallel-agents — parallel search in discover phasesuperpowers:verification-before-completion — output verification in cite/write/discover phasesmcp__codex__codex) — cross-model collaboration throughout the lifecycle (discuss Phases 2-9, write Step 5.5 + 5.6). See Cross-Model Collaboration section for all 10 invocation points. If unavailable, all phases proceed with Claude-only analysis (log warning). Recommended but not blocking.mcp__claude_ai_alphaXiv__*, via the claude.ai connection) — powers the second parallel search source in discover Step 3. The three tools (agentic_paper_retrieval, full_text_papers_search, embedding_similarity_search) are issued as three parallel tool calls inside the AlphaXiv subagent (dispatched alongside the S2 subagent via superpowers:dispatching-parallel-agents). If the MCP server is unavailable, discover falls back to S2 results only (log warning). Recommended for broader coverage but not blocking.S2_API_KEY in .claude/settings.json under "env". Claude Code automatically exports it as $S2_API_KEY. Get from: https://www.semanticscholar.org/product/api/api-keyOPENREVIEW_USER and OPENREVIEW_PASS in .claude/settings.json under "env". Register at: https://openreview.net/profile. Without these, OpenReview review/rebuttal data will not be fetched (all other features work normally).If dependencies are missing on first use:
Before using /research, please ensure:
1. Install required skills/plugins:
- Orchestra-Research AI-Research-SKILLs (provides ml-paper-writing, brainstorming-research-ideas, creative-thinking-for-research, and 21 domain skill categories)
- humanizer skill
2. Set up API keys in your .claude/settings.json:
{ "env": { "S2_API_KEY": "your-key-here", "OPENREVIEW_USER": "optional", "OPENREVIEW_PASS": "optional" } }
Claude Code automatically exports env entries as environment variables.
- Semantic Scholar: https://www.semanticscholar.org/product/api/api-key
- OpenReview (optional): https://openreview.net/profile
| Service | Limit | Strategy |
|---|---|---|
| S2 | 1 req/sec (with key) | Sequential within agent, use batch/bulk |
| DBLP | ~1 req/sec | Sequential, 1s delay |
| CrossRef | No strict limit | Polite usage |
| AlphaXiv MCP | No strict limit | Three retrieval tools run in parallel inside Agent 2 |
| AlphaXiv fetch (raw markdown) | No strict limit | Respect 404s, no retry loop |