Claude + Codex MCP 完整协作框架。涵盖调用方法、角色分配、效率规则、成本路由和任务模板。当 Claude 需要委派任务给 Codex MCP 或与 Codex 协作时加载此 skill。
| Tool | Purpose | Key Return |
|---|---|---|
mcp__codex__codex | Start a new Codex session | { threadId, content } |
mcp__codex__codex-reply | Continue an existing session | { threadId, content } |
mcp__codex__codex| Parameter | Required | Description |
|---|---|---|
prompt | Yes | The task description. Be specific, include file paths. |
sandbox | No | (analysis/inspection) or (file changes). Default: . Prefer when possible. |
"read-only""workspace-write""workspace-write""read-only"cwd | No | Working directory. Default: project root. |
approval-policy | No | "never" (recommended for trusted projects — fully autonomous, no prompts), "on-failure" (auto-execute, prompt only on failure), "on-request" (auto-execute, prompt only when Codex asks), or "untrusted" (prompt for every command). Always pass this parameter explicitly — omitting it may default to "untrusted" which causes frequent approval popups. |
mcp__codex__codex-reply| Parameter | Required | Description |
|---|---|---|
threadId | Yes | The threadId returned by the initial mcp__codex__codex call. |
prompt | Yes | The follow-up question or instruction. |
mcp__codex__codex(
prompt="Read src/engine_health/diagnosis/track_b_sequence_detection.py lines 28-50 and explain the TrackBSequenceDetectionConfig fields. Answer in under 200 words.",
sandbox="read-only",
approval-policy="never",
cwd="/Users/charles/Desktop/BYSJ/projectv2"
)
mcp__codex__codex-reply(
threadId="019d70b2-...",
prompt="Now check what default threshold_mode is used and whether it matches the formal package."
)
Default role mapping by task type:
| Task Type | Executor | Reviewer | Rationale |
|---|---|---|---|
| Implementation | Codex | Claude | Codex is cheaper for bounded execution; Claude has deeper context for correctness review. |
| Analysis / research | Codex | Claude | Claude has stronger critical judgment for synthesis and review. |
| Critical decisions | Both independently | Claude synthesizes | Avoids single-model blind spots. |
| Mechanical / low-priority | Codex alone | None | Not worth reviewer overhead. |
/tmp/codex_user_context_<ID>.md — the user's recent messages (verbatim, last 1-3 turns as relevant). This is the raw requirement source./tmp/codex_task_<ID>.md — Claude's analysis, scope, constraints, expected output, and non-goals. This is Claude's instruction layer.
<ID> is a short unique suffix (e.g., first 8 chars of a UUID or timestamp) to avoid collisions between concurrent sessions.
Then pass Codex a short prompt: "Read /tmp/codex_user_context_<ID>.md for the raw user request and /tmp/codex_task_<ID>.md for your task instructions. Execute accordingly." This keeps Claude's conversation context small and preserves a clean separation between raw user intent and Claude's framing.approval-policy: Every mcp__codex__codex call must include approval-policy="never" to prevent interactive approval popups. Omitting this parameter may cause Codex to prompt for every shell command, blocking autonomous execution.mcp__codex__codex call returns a threadId, use mcp__codex__codex-reply with that threadId for follow-up questions in the same domain. This avoids cold-start overhead.read-only sandbox: For pure analysis/inspection tasks, pass sandbox: "read-only". This reduces sandbox overhead.mcp__codex__codex calls rather than sequential ones.Claude tokens are expensive, Codex tokens are cheap. Route accordingly:
| Task Profile | Route To | Reason |
|---|---|---|
| High-stakes (architecture, tradeoff, review, final synthesis) | Claude directly | Quality matters more than cost. |
| Low-priority mechanical (bulk inspection, formatting, docs, simple summaries) | Codex | Save Claude token budget. |
| Bounded execution (implementation, testing, debugging) | Codex | Well-scoped, Codex is sufficient. |
| Open-ended planning, synthesis, multi-source integration | Claude | Requires deep reasoning. |
When delegating to Codex MCP, frame each task using this template:
Scope: [files, directories, or modules]
Goal: [what to accomplish]
Constraints: [what not to change, time/size limits]
Expected output: [format and content of the result]
Non-goals: [what this task explicitly does NOT cover]
dual-agent-original-request-review/SKILL.md. Codex tasks that touch the codebase or real artifacts inherit those rules.Use when Claude cannot confidently determine all relevant file paths before dispatching Codex. Common triggers:
Do NOT use when file paths are already known — go directly to the standard Two-File Handoff. The decision rule: if Claude can fill the Scope field of the Task Framing Template (§5) with specific paths, skip this protocol.
Round 1 — Broad discovery (Codex searches, Claude evaluates)
Claude → Codex: "Search <directory/module> for <keywords/patterns>.
List files with: path, one-line summary, relevance tier, mtime (if artifact).
Cap at 15 files."
Codex → Claude: file list + relevance notes
Claude evaluates: assign relevance tiers, pick top 3-5 files for Round 2.
Claude may add new keywords discovered from Codex's file list.
Round 2 — Narrowed inspection (standard Two-File Handoff resumes)
Claude writes discovered paths into /tmp/codex_task_<ID>.md as the Scope field.
Claude → Codex: "Read <specific files/line ranges>.
Answer <targeted analytical question>."
Codex → Claude: detailed findings
Round 3 — Targeted verification (only if Round 2 reveals a gap)
Allowed only for: verifying a specific claim, chasing one newly discovered dependency.
NOT allowed for: introducing a new hypothesis or widening scope.
Claude → Codex (via codex-reply on same threadId):
"Read <newly discovered file> lines X-Y.
Verify whether <specific claim from Round 2>."
Codex → Claude: verification result
Hard cap: 3 rounds. If relevant files still cannot be located after 3 rounds, Claude reports the gap to the user rather than expanding further.
Claude assigns a tier to each file Codex returns in Round 1:
| Tier | Meaning | Action |
|---|---|---|
| High | Directly implements or contains the target logic/data | Read in Round 2 |
| Medium | Tangentially related (imports, configs, test files) | Read only if High files are insufficient |
| Low | Unlikely relevant (naming coincidence, unrelated module) | Drop unless no better candidates exist |
Threshold: at least 2 High-tier files needed before proceeding to Round 2. If Round 1 yields 0 High files, refine keywords and retry Round 1 once (counts toward the 3-round cap).
Two-File Handoff: Round 1 uses a lightweight inline prompt (no Two-File needed — it's a search task). Round 2+ transitions to standard Two-File Handoff once scope is known.
Session reuse: use codex-reply with the same threadId across all rounds — Codex retains its discovery context, reducing cold-start overhead.
Artifact-Grounded Review: when iterative retrieval locates result artifacts, the staleness check (artifact-grounded-review/SKILL.md § Artifact Staleness Check) applies to every discovered artifact. Codex must report artifact mtimes in Round 1 so Claude can filter stale ones before Round 2.
Verdict-Affecting Claims: files discovered via iterative retrieval that become the basis of a verdict-affecting claim must be tagged in the executor's claim list (per dual-agent-original-request-review/SKILL.md § Standard Process step 5).
Scope: <top-level directory or module — intentionally broad>
Goal: Find files related to <topic/keywords/patterns>.
For each file report: path | one-line summary | relevance (High/Med/Low) | mtime (artifacts only).
Constraints:
- Cap at 15 files
- Do NOT read file contents — list only
- Include mtime for result artifacts (JSON/CSV/NPZ) so staleness can be assessed
Expected output: Markdown table
Non-goals: Deep analysis — that comes in Round 2.
/) instead of a targeted directory.workspace-write sandbox for read-only tasks.任何新增文献调研任务:related work 写作、baseline 探索、方法发散、综述补洞。
Level 1 — Triage(判定是否相关)
每篇论文只喂:abstract + introduction。
获取优先级:arXiv TeX 源 > 期刊 HTML/Markdown > 本地 PDF 前 ~2 页。
Codex 输出(每篇一条):
- go / kill
- 一句话理由
- relevance tag(与本项目哪条研究线相关)
Level 2 — Method Reading(通过 triage 的论文)
只喂相关 section(method / experiments),不灌入全文。
优先级同 Level 1。
Level 3 — Full Paper(需要复现或深度引用)
完整加载,但必须同轮更新
`docs/10_diagnosis/literature_*/literature_manifest.json` 登记。
https://arxiv.org/e-print/<id>(arxiv.org 已在 settings.local.json WebFetch 白名单)。literature_manifest.json 登记。Level 3 新增条目最少包含:
paper_idtitleyeardownload_urlproject_use(如何用到本项目,具体到 claim / baseline / method)not_for(明示不可外推的场景,避免未来误引)落点:
docs/10_diagnosis/literature_track_b/literature_manifest.jsondocs/10_diagnosis/literature_fault_injection/literature_manifest.jsonartifact-grounded-review 一致:任何基于文献做出的判断须引用具体 paper:section,禁止从 Codex 的单行 summary 直接上升为 claim。Claude 在读任何 PDF 或向 Codex 派发 PDF 前,必须先回答三问:
| Level | TeX / HTML / Markdown | |
|---|---|---|
| 1. Triage | ✅ 只喂 abstract + intro | ❌ 整篇禁。如只有 PDF,先手工 / 脚本抽 abstract+intro 段落,再喂纯文本 |
| 2. Method reading | ✅ 只喂相关 section | ⚠️ 仅允许按 page range 抽取的节选(例:pp. 3–7),整篇禁 |
| 3. Full paper | ✅ | ✅ 但必须同轮更新 literature_manifest.json 登记 |
Gate 被违反的后果:上下文被稀释、后续决策质量下降(作者实测现象)。违反后的自救:立刻 /compact 已灌入的整篇 PDF 内容,按正确 Level 重新喂摘取版。Gate 不依赖 Codex,是 Claude 主线程的责任。
开放式技术决策:候选未知、多个方案可行、需要广覆盖的发散 + 强约束的严审。典型触发:
不要用于:执行已决定的任务、bug fix、实现已知接口——那些走标准 Two-File Handoff(§3)。
mcp__codex__codex 新 session 完成)Stage 1 — Divergent Generator (Codex session A)
Input: raw user request + baseline context + constraints only.
NO "prior attempts"、NO "previously rejected ideas"、NO Claude 自己的 lean。
Ask: "Generate N=10 candidate approaches, each with: name, one-paragraph
rationale, key assumption, failure mode. Do not rank. Do not self-filter."
Output: 10 candidates, flat list.
Stage 2 — Strict Scorer (Codex session B, isolated from A)
Input: raw user request + baseline context + ONE candidate at a time.
Session B sees NO other candidates, NO generator's self-assessment.
Ask: "Score this candidate on: novelty, feasibility, baseline-compatibility,
implementation complexity, expected gain. Give go / revise / kill plus
rationale. Cite specific code or artifact paths from baseline when
claiming compatibility."
Output: 10 independent score cards (parallel dispatch, different threadIds).
Stage 3 — Decisive Synthesis (Claude, main thread)
Read all 10 score cards. Rank by score + fit with project constraints
(journal bar / Track B scope / comparability_impact risk).
Produce a single ranked short-list (top 2-3) with rationale.
If top candidate has UNVERIFIED compatibility claim, route to Codex
for verification before committing.
mcp__codex__codex,不是 codex-reply)。dual-agent-original-request-review 的 raw-request preservation)。artifact-grounded-review 冲突:Stage 2 的严审必须 cite 具体 file:line / artifact:key,缺失证据的评分标 UNVERIFIED。dual-agent-original-request-review Verification Discipline;Stage 3 产出如果改变项目走向(trainer / baseline / 评分口径),仍须回到 dual-agent review 做正式验收。## Divergent-Strict-Decisive Decision Log
Stage 1 candidates (Codex session A, threadId=<...>):
1. <name> — <one-sentence>
2. ...
10. ...
Stage 2 score cards (parallel, 10 independent sessions):
| Candidate | novelty | feasibility | compat | complexity | gain | verdict | key evidence |
|-----------|---------|-------------|--------|------------|------|---------|--------------|
Stage 3 ranked short-list (Claude):
1. <top pick> — rationale + UNVERIFIED claims to resolve
2. <runner-up>
缺 Stage 2 的证据栏 或 Stage 3 的 short-list 理由,视为流程未完整,回流到 Claude 补齐。
主线程在开工之前就建 TaskCreate 链。满足任一条件即触发:
ssh_tmux session)TaskUpdate status=in_progress,完成立刻 status=completed,不批量攒到最后。success_signal / artifact_output_paths 对齐。