Create and maintain scientific/empirical experiment or investigation reports in Notion via MCP from freeform artifacts. Default to direct Notion editing with a Claude-assisted wording/structure pass when available; support local-first draft QA before publishing when explicitly requested. Use the report shape that best communicates the evidence instead of forcing one narrative format.
Encourage use of multiple agents/subagents when it is likely to improve speed, quality, or confidence.
Split work into clear packets with owners, inputs, acceptance checks, and a synthesis step when parallelizing.
Use single-agent execution when scope is small or coordination overhead outweighs gains.
Proactive autonomy and knowledge compounding
Be proactive: immediately take the next highest-value in-scope action when it is clear.
Default to autonomous execution: do not pause for confirmation between normal in-scope steps.
Request user input only when absolutely necessary: ambiguous requirements, material risk tradeoffs, missing required data/access, or destructive/irreversible actions outside policy.
If blocked by command/tool/env failures, attempt high-confidence fallbacks autonomously before escalating (for example rg -> find/grep, python -> python3, alternate repo-native scripts).
When the workflow uses , ensure required plan directories exist before reading/writing them (create when edits are allowed; otherwise use an in-memory fallback and call it out).
Related Skills
plan/
Treat transient external failures (network/SSH/remote APIs/timeouts) as retryable by default: run bounded retries with backoff and capture failure evidence before concluding blocked.
On repeated invocations for the same objective, resume from prior findings/artifacts and prioritize net-new progress over rerunning identical work unless verification requires reruns.
Drive work to complete outcomes with verification, not partial handoffs.
Treat iterative execution as the default for non-trivial work; run adaptive loop passes. Example loops (adapt as needed, not rigid): issue-resolution investigate -> plan -> fix -> verify -> battletest -> organise-docs -> git-commit -> re-review; cleanup scan -> prioritize -> clean -> verify -> re-scan; docs audit -> update -> verify -> re-audit.
Keep looping until actual completion criteria are met: no actionable in-scope items remain, verification is green, and confidence is high.
Run organise-docs frequently during execution to capture durable decisions and learnings, not only at the end.
Create small checkpoint commits frequently with git-commit when changes are commit-eligible, checks are green, and repo policy permits commits.
Never squash commits; always use merge commits when integrating branches.
Prefer simplification over added complexity: aggressively remove bloat, redundancy, and over-engineering while preserving correctness.
When you touch code, leave the touched area in a better state than you found it: clearer, simpler, tidier, and at least as performant unless the task requires an explicit trade-off.
Use simple, plain English in user messages, docs, notes, reports, code comments, and other explanatory writing. Avoid jargon, fancy wording, and complex phrasing. When a technical term is needed for correctness, explain it in simple words the first time. Default to short user-facing responses. Think about what the user most wants to know, and lead with that. Do not dump every detail by default. Always include important changes, blockers, verification gaps, and any important assumptions, nuances, principles, or decisions that shaped the work. Add more detail only when the user asks for it or when uncertainty or risk makes it necessary.
Compound knowledge continuously: keep docs/ accurate and up to date, and promote durable learnings and decisions from work into docs.
Long-task checkpoint cadence
For any non-trivial task (including long efforts), run recurring checkpoint cycles instead of waiting for a single end-of-task wrap-up.
At each meaningful milestone with commit-eligible changes, and at least once per major phase, invoke git-commit to create a small logical checkpoint commit once relevant checks are green and repo policy permits commits.
At the same cadence, invoke organise-docs whenever durable learnings/decisions appear, and prune stale plan/ scratch artifacts.
If either checkpoint is blocked (for example failing checks or low-confidence documentation), resolve or record the blocker immediately and retry before expanding scope.
Interruption handoff (required for multi-pass report work)
Before stopping, switching sessions, or yielding after tool/session interruption, write a concise handoff in plan/handoffs/ or plan/current/notes.md.
Include the report/page identity, latest evidence bundle reviewed, mutations already published, remaining open edits, and the exact next step.
On resume, load the handoff first so the report is continued in place instead of re-audited from scratch.
Create and refine a single canonical report page in Notion from evidence artifacts.
This skill is cross-project by default. It is designed for scientific/empirical reports in any domain. Domain overlays (for example trading/market experiments) are optional and only apply when relevant.
Terminal state contract (must follow)
The skill is complete only when all of the following are true:
Objective completion: the user-requested outcome is achieved, or explicitly marked blocked with concrete blocker evidence.
Workflow completion: every required workflow step is resolved as done, blocked, or not-applicable, with brief evidence or rationale.
Step-level terminal completion: each numbered subtask must have explicit completion evidence (artifact, command output, or written rationale) before advancing.
Verification completion: required checks/validations for this skill are executed, or any unavailable checks are explicitly called out with impact.
Findings completion (where applicable): report only evidence-backed findings; if no high-confidence critical findings are present, explicitly state that.
Loop completion: no actionable in-scope next step remains under the current objective.
Stop only after this terminal contract is satisfied; otherwise continue iterating.
Terminal state examples (adapt to skill)
done: canonical report page is created/updated with required section order, explicit opener answer, evidence-backed claims, and post-update verification checks.
blocked: Notion parent resolution, page ownership constraints, or upload/write capability prevents compliant publishing after bounded retries; blocker evidence and exact unblock steps are reported.
not-applicable: optional visuals/upload steps are explicitly skipped with rationale when the report has no visual requirements.
Core principles (must follow)
Notion ownership gate is mandatory for existing-object mutations:
when available, verify ownership with stable created_by user/bot IDs (not display names alone)
creating new pages/entries in approved shared destinations (for example Quant Blog) is allowed when explicit in-task user intent is present
before mutating any existing object (update, move, delete, comment, append), verify ownership from created_by and resolved user identity
after creation, enforce the same ownership checks on subsequent mutations of that created entry/page
if ownership is ambiguous or cannot be verified, fail closed and stop with a blocker
no fixed input schema required
accept freeform evidence inputs: notes, logs, tables, plots, artifacts, and run outputs
descriptive-first reporting by default: factual, evidence-backed, and non-prescriptive unless user explicitly requests guidance
choose the report shape that best communicates the evidence: question/answer, objective/outcome, comparison-first, narrative walkthrough, or another structure when that is the clearest fit
use explicit Question -> Answer flow when the real reader need is a concrete question and answer, but do not force Q/A scaffolding when another structure is clearer
if a report is not naturally question-driven (for example operational status, incident summary, design comparison, or evidence ledger), state the objective and outcome plainly instead of forcing Q/A labels
avoid narration that adds mood but not evidence (for example "this gave a clear picture" or similar framing)
remove time-window framing such as overnight unless that timing matters to interpretation
unless failure behavior is the explicit experiment question, do not make failure-rate metrics the primary headline; treat them as reliability context supporting core outcomes
one canonical page per report identity; refine in place (no v2/final forks)
default publish destination for generated reports is Quant Blog (database row/page entries), unless the user explicitly asks for a different destination
consolidation-first: keep the fewest report pages that still map cleanly to distinct report identities/questions
plot-first for quantitative reporting; tables are supporting precision
in Quant Blog, keep date in the Date property and keep Title as <clear human title> (no leading date token in title text)
use rerun in titles only when the report explicitly links to the prior report/experiment being rerun
avoid opaque shorthand/acronym-heavy naming in titles/headings/captions (for example coded labels like RB1200 or B500); prefer plain-language wording
keep titles human-facing: do not include hash-like search/run IDs in titles (for example short hex tokens)
when same-topic reports represent distinct runs, keep separate pages and differentiate titles with plain-language descriptors (scope/design/coverage), not internal IDs
for experiment/search reports, record machine-facing IDs (search ID, run ID, cluster job IDs, artifact pointers) in a final Provenance Appendix, not the title
explicit opener emphasis hierarchy for scanability:
bold label-first opener
concise high-signal callouts
bold key metrics/status tokens only (no blanket bolding)
opener clarity is mandatory:
first Top Takeaways line must contain the clearest high-level framing for the report: either the explicit report question and direct answer, or the explicit objective and outcome
opener wording must be self-contained plain language; a new reader should not need prior thread context
when the opener is question-driven, use direct answer tokens (yes, no, inconclusive, or equivalent), not vague status-only wording
if outcome is mixed, map it to explicit direct-answer or objective/outcome wording in the opener (for example runtime yes, performance no or objective met for runtime, not for performance)
avoid opener text that names status without concrete content
when editing existing reports, normalize legacy opener labels to the clearest opener framing for the actual report instead of forcing one fixed label pattern
if more than one key question is covered, use numbered Question n / Answer n pairs with strict adjacency (each answer immediately follows its question)
for multi-question or mixed-certainty investigations, add a dedicated Question Decomposition section immediately after Top Takeaways using strict Question n -> Answer n -> Status n triples, with one concrete empirical evidence line in each answer
keep the opener focused on the true experiment objective; do not let reliability/failure-rate stats become the lead claim unless failure analysis is the stated question
if user scope is a report collection/hub, audit every in-scope report page one-by-one (no sampling)
legacy-summary fallback rule: if the first summary section is not Top Takeaways (for example Key Takeaways or Executive Summary), enforce the same explicit opener rule in that section's first line
for collection-wide cosmetic requests, restrict edits to presentation and readability (formatting/emphasis/label normalization) and preserve reported numbers, conclusions, and claim text unless the user explicitly asks for content changes
during readability/cosmetic passes, remove duplicate headings and repeated near-identical intro paragraphs when they are clearly redundant and the change does not alter meaning
during refinement passes, aggressively remove duplicate sections, repeated claims, and repeated captions unless each repetition adds new evidence
for comparative experiment reports, include a dedicated Eval Setup section directly after Experiment Definition
Eval Setup must describe eval-time behavior in plain English (what checkpoint is evaluated, what is fixed vs randomized, and why the comparison is or is not apples-to-apples)
never use the heading Eval Setup Clarified; use Eval Setup
reports and report-generation helpers are ephemeral artifacts:
keep under plan/
never commit under tracked code paths (tools/, src/, experiments/, docs/)
traceability must be shared-reader friendly:
when relevant, include repository, branch, PR, and commit references so another reader can retrace the work
never include personal machine paths or other user-specific local filesystem references in the report body
when file/path references are useful, write them relative to the repository root (for example src/train.py, not repo-root/src/train.py)
when a GitHub remote exists, prefer clickable GitHub links for repo, PR, commit, and file references
if no GitHub link can be derived safely, fall back to repo name + branch + commit hash + repo-root-relative paths
keep report presentation human-facing:
reports should read as human-authored and human-consumable
do not add Codex/Claude labels in report titles or main body sections
keep exactly one small attribution footer note at the end containing Prepared with support from Codex and Claude. codex-managed: true
if any extra Codex/Claude references are found during refinement, remove them before publish
Notion-native formatting is mandatory:
do not leave literal markdown control tokens in published Notion text blocks (for example `code` or **bold**)
convert emphasis/code intent to Notion rich-text annotations before finalizing
after publish, run a token-hygiene verification pass and normalize any residual markdown markers in place
keep primary evidence expanded by default:
do not hide core results, primary plots, or key findings inside collapsible/toggle blocks
use toggle/collapsible blocks only when the user explicitly requests that layout
if original numeric artifacts still exist, regenerate publication-grade plots from source data; do not use lightweight Notion-native chart specs (for example Mermaid xychart-beta) for quantitative findings
do not upload report artifacts to third-party public/temporary file hosts (for example tmpfiles.org, catbox, imgur, file.io) unless the user explicitly approves
default to Notion-managed file/image uploads for report visuals (prefer notion-upload-local when available); if upload is unavailable in the current toolchain, add Artifacts to attach placeholders rather than using unapproved external hosts
hide local paths, hostnames, tokens, and secrets in report body text
define acronyms on first use
write for first-time readers with no project background; do not assume internal context
minimize project-specific jargon in headings, captions, and body text; prefer plain everyday wording
when technical terms are required for scientific precision, define them in simple language at first use
keep section headings and plot labels self-explanatory so an outside reader can infer the purpose immediately
define short encoded labels at first use and near the plot that uses them (for example n4, r8, or similar run-scale labels); do not assume the reader knows local shorthand
define environment family names in plain English near first use; if the report uses internal environment labels, add a short reader-facing explanation beside them or in an adjacent definitions block
Local-first mode (only when user explicitly requests)
build a deterministic local bundle before Notion writes:
report body (.html or .md)
summary payload (.json)
high-resolution plots (>=220 dpi)
use full artifact rows for core aggregate/distribution visuals unless explicitly labeled sampled
do not downsample plotted data by default; this is especially required for time-series plots
preserve native time cadence for time-series plots by default (no implicit temporal aggregation/resampling, e.g. day->month)
prefer high-fidelity rendered images (PNG/SVG) generated from source artifacts over inline chart DSL blocks
use Mermaid/Notion-native charts only as an explicit fallback when source data is irretrievably unavailable
if downsampling or temporal aggregation is unavoidable for tooling/render limits, disclose it explicitly in both caption and methodology (method + factor + why)
validate local quality before publishing:
explicit axis labels + units
readable legends
no overlap/clipping
decision-useful without zooming
store local bundles in plan/artifacts/
keep temporary helper scripts in plan/experiments/ (or similar ephemeral plan/ path)
Allowed-owner existing pages only (must follow)
Codex may create new pages/database entries in approved destinations when explicit user intent exists. Codex may mutate existing pages/database entries only when they are owned by the allowlisted principals:
disallowed opener pattern: status-only wording with no concrete question/objective text
use question/answer lines as the main driver only when that is the clearest fit for the report
if another structure explains the evidence better, use it instead and keep the opener explicit
legacy heading fallback: if first summary heading is Key Takeaways/Executive Summary, apply the same first-line opener rule there
include the most important outcome immediately
if there is a clear before/after baseline relation, include explicit delta line (before -> after, absolute + relative where available)
apply emphasis hierarchy:
bold label-first opener
concise callouts for major win/loss tradeoff when both exist
selective bolding of key metrics/status only
if multiple questions are answered, list numbered Question n then Answer n lines directly under the opener, keeping each answer immediately after its paired question
when the report includes several sub-questions or mixed answer certainty, include a dedicated Question Decomposition section immediately after Top Takeaways with Question n -> Answer n -> Status n ordering and one empirical evidence line per answer
answer wording that directly maps to the question terms (no implicit inference required)
explicit answer status for this batch (yes/no/inconclusive or equivalent) with brief evidence-backed why
3) Eval Setup (for experiment/search reports)
place immediately after Experiment Definition and before any results/plots section
explicit eval-time setup details in plain English:
which checkpoint/policy snapshot is evaluated
exact eval composition and overrides (for example opponent setup or learned-agent count)
what is fixed at eval vs re-sampled/randomized
domain-randomization behavior at eval for key variables (for example fees and starting capital): fixed value(s) vs sampled range(s), and sampling cadence
explicit apples-to-apples statement across compared runs/batches
if any detail is unknown, mark it unknown and include it in Limitations / Reliability
4) Early Results Section
place early, after Eval Setup
include compact high-signal visuals first
prioritize visuals that answer the report question directly
use a plain-language heading that fits the report, such as Quick Aggregate Snapshot or another direct results label; avoid corporate/executive wording unless the page already uses it consistently
order visuals from simple to more complex, or from broad context to the more detailed evidence the reader needs next
if broad aggregate plots are too coarse or hide too much, demote them to a quick scan only and make run-level or matched-comparison plots the main evidence section
when aggregate plots remain, state clearly that they are a first scan and not the main evidence
5) Definitions and Methodology
define domain terms used in the report context
define formulas for derived metrics and signs/directions
include unit unavailable placeholders for unknown units; do not guess
6) Limitations / Reliability
use this heading text exactly: Limitations / Reliability
include missing data, incomplete runs, comparability gaps, and caveats
keep reliability emphasis proportional to impact
do not make completion/failure the hero narrative unless it materially changes conclusions
when failure rates are not the explicit research target, report them as reliability support (not the primary report focal point)
7) Run Spotlight (when run-level time-series artifacts exist)
include 1-3 explicitly selected runs
when run-level traces are the most decision-useful evidence, make this section the main evidence section rather than a side note
when a project already has a standard single-run plot set, include that standard set unless there is a strong reason not to
if both standard single-run plots and fuller all-points plots exist, keep both when they answer different reader needs and explain the role of each section
when a project has true one-episode or one-run evaluation charts, include them explicitly; do not treat training-path or checkpoint-path plots as a substitute
if both actual episode charts and training/checkpoint plots exist, label the sections so the reader can tell the difference immediately
show normalized trajectory views when possible (for example growth from 1.00x)
for return trajectories, prefer relative capital units and state the scale explicitly in captions (for example 1.0 = +100% net return)
for turnover/exposure trajectories, prefer relative units (x capital/day, turns/day, x equity) and define directional interpretation for signed exposure
when trace artifacts exist, do not stop at a single compressed overview image
include at least one dedicated performance-path figure for each spotlighted run with:
capital multiple or equivalent PnL trajectory
underwater or max-drawdown path
daily turnover multiple or equivalent daily activity path
include at least one dedicated inventory/book-shape figure for each spotlighted run with:
per-symbol position or target-position time series
gross and net exposure time series
if both target and realized positions are available, prefer showing both somewhere across the spotlight package; if only one can be shown clearly, prefer the more decision-relevant one and state that choice
if unavailable, explicitly call out artifact gap in limitations
prefer plain-English titles and labels over team shorthand; define any unavoidable technical term in simple words nearby
keep captions direct and objective; do not use story-like scene-setting or hype language
order plot groups so the reader can follow them easily from basic to harder, and from broad scan to detailed proof
if broad aggregate visuals hide too much variation, do not let them dominate the page; prefer per-run, matched, or progression plots instead
when a timing plot shows raw duration rather than normalized speed, say that plainly in the caption or immediately next to the plot
when a timing plot is not apples-to-apples on its own, point the reader to the normalized speed plot or the fair comparison plot
if a run-level plot uses training-log rows that mix outer iterations and inner training steps, say that plainly so repeated or rising timing lines are not misread
For tables:
clear title
explicit column headers + units
clear derived-field names
short caption with Takeaway: lead-in
General:
plot-first for quantitative interpretation
prefer fewer high-signal visuals over many redundant ones
prefer relative/normalized units in primary visuals when both raw and relative forms exist (keep raw units as secondary context)
when full-resolution full-detail run plots exist, prefer them as the core evidence section instead of relying mainly on broad summaries
no downsampling by default, especially for time-series visuals
no implicit temporal aggregation for time-series visuals; keep native cadence unless explicitly justified
avoid Mermaid xychart-beta (or equivalent low-fidelity DSL charts) for empirical metric plots when source artifacts exist
if downsampling or temporal aggregation is unavoidable, disclose exact method/factor and expected visual impact
Cross-project refinement loop (must follow)
keep the skill project-agnostic: domain specifics are overlays, not defaults
benchmark draft quality against the strongest prior report style available for the current project/domain
keep what improved readability/interpretability and remove low-signal sections/visuals
use selective emphasis only (bold key conclusions/metrics/status), not blanket formatting
ensure the final report remains descriptive-first unless recommendations were explicitly requested
Optional domain overlays
Use only when relevant to the current project/domain.
Trading/market experiments (optional overlay)
include trader/quant lens and ML/DL lens in Top Takeaways
common metrics may include PnL, Sharpe, drawdown, turnover, execution intensity, and robustness
prefer interpretable intensity metrics (for example turns/day) over raw fill counts unless fill mechanics are the focus
Cluster-backed artifacts (optional overlay)
verify run-level artifact availability directly (read-only) before declaring missing
when available, use retained forensics/root-cause fields for failure attribution rather than wrapper labels only
Claude-assisted drafting
Default for report wording/structure refinement when Claude is available.
command: claude -p --model opus --effort high
Codex remains source-of-truth for data prep, validation, and Notion writes
when available, verify ownership using stable created_by user/bot IDs and treat display names as labels only
creating new entries/pages in approved shared destinations is allowed when explicit user intent is present
for existing target objects, read metadata and verify created_by principal is allowlisted before mutating
for create actions in shared databases/workspaces, require explicit user intent for that destination; then enforce ownership checks on the created/updated entry itself
if ownership cannot be verified with high confidence, fail closed and report blocked
If visuals are required, verify notion-upload-local upload capability before publishing image blocks.
Run at least one Claude wording/structure pass when available, including prompt context for:
reader-first plain-English writing with minimal jargon
explicit eval-time setup clarity and apples-to-apples statement for comparative reports
Resolve parent:
normalize/verify user-supplied URL/ID with mcp__notion-local__API-retrieve-a-page
otherwise enumerate/search candidate parents and validate
Create/update canonical page:
derive report identity from evidence (experiment/search/run IDs, batch label/date, variant)
when publishing to Quant Blog, search existing rows by Project + Title (+ Date when available)
create only if no matching canonical row exists
update in place otherwise
avoid creating duplicates for title wording changes
ensure Quant Blog properties are correct (Project, Date, Title without leading date prefix)
include one small footer attribution note at end with Prepared with support from Codex and Claude. codex-managed: true
before finalizing, confirm no other Codex/Claude references remain in title/body
when relevant, add a compact traceability section or Provenance Appendix entries that include:
repository identity
branch name
PR link/number when applicable
commit hash references for the analyzed or reported state
repo-root-relative file paths for important touched files/artifacts
clickable GitHub links when a GitHub remote exists
for experiment/comparison reports, verify eval-time configuration details against available experiment definitions/search spaces/results artifacts before claiming comparability
when performing cosmetic-only passes, avoid rewriting analytical content; normalize rich-text formatting and opener labels in place
Upload visuals:
prefer notion-upload-local tool path for direct Notion-managed uploads
if upload tooling is unavailable, add Artifacts to attach placeholders rather than external-host workarounds
place visuals in the main narrative section (Executive Visual Snapshot) rather than appending them after footers/appendices
after upload, verify visuals are anchored at intended locations and remove misplaced duplicate image blocks
after upload, replace default raw file-name captions with plain-English Takeaway: captions or a short human-readable description
when updating an existing visual, archive/delete the superseded block so only one current plot version is visible per plot identity
verify replacement integrity by re-fetching both superseded and new block IDs (old = archived, new = active)
Implementation note for Notion MCP block updates:
API-update-a-block payload must use the concrete block-type key directly (for example paragraph, heading_2, bulleted_list_item) rather than a nested type wrapper
Deduplication behavior:
keep exactly one canonical page per report identity in target reports location
prefer consolidation into fewer pages when multiple pages represent the same identity/question
when duplicates exist for same data identity, retain one canonical page; move/archive others only with user approval
when pages are same topic but distinct run identities, retain both and rename for clear human differentiation
Output contract:
return created/updated Notion page URL (or ID)
Image embedding constraints (Notion reality)
preferred rendering path is Notion-managed file/image uploads
do not use third-party public file hosting by default
external hosting is allowed only with explicit user approval and should prefer first-party/org-controlled infrastructure
data: URIs and local-only schemes are not durable in Notion
after write, re-fetch and verify image/file blocks still have non-empty URLs and non-zero payloads at verification time
Embedding sequence:
notion-upload-local direct file/image upload (Notion-managed)
browser upload fallback (if available) on Codex-managed page (Notion-managed)
reuse existing Notion-managed file/image blocks when valid
if all fail, add Artifacts to attach section with neutral labels + meaning (do not use unapproved third-party hosts)
Iteration loop and finish criteria
run multi-pass reader-centric refinement:
pass 1: structure and narrative arc
pass 2: skeptical-reader questions and caveats
pass 3: clarity and emphasis hierarchy
in every pass, check for and remove duplicate information, stale wording that no longer matches the page structure, and headings that are less clear than a simpler plain-English alternative
after each pass: update same canonical page and re-fetch to verify landing
done when:
required sections present
opener clearly states the main question and answer, or the main objective and outcome, whichever better matches the report
opener framing is unambiguous to a cold reader
top-of-page readback is clean: no duplicate summary headings, no repeated intro blocks, and the opening screen reads naturally in plain English
when multiple questions are present, Q/A ordering is explicit and adjacent (Question n then Answer n)
when Question Decomposition is present, each Question n has adjacent Answer n and Status n lines, and each Answer n includes one concrete empirical evidence line
experiment/comparison reports include a dedicated Eval Setup section immediately after Experiment Definition
Eval Setup includes explicit fixed-vs-randomized behavior and an apples-to-apples statement (or explicit mismatch caveat)
empirical/comparison report section order is explicit and preserved:
Top Takeaways -> Experiment Definition -> Eval Setup -> Early Results Section -> Definitions and Methodology -> Limitations / Reliability (+ Run Spotlight when applicable)
when both broad aggregate plots and full-detail run plots are present, the page should make clear which section is the quick scan and which section is the main evidence
short run-scale labels, environment names, timing-metric meanings, and the compute shape behind the compared points are explained at first use and near the relevant plots
when standard single-run plots exist for the project, the report includes them or explicitly says why they were skipped
experiment/search reports include a final Provenance Appendix with machine-facing identifiers
where repo context matters, traceability references are present and shareable: repo, branch, PR, commit, and repo-root-relative path references with GitHub links when available
claims are evidence-grounded
formatting hygiene checks pass (no residual literal markdown markers like `...` or **...** in published block text)
visuals are in their intended section context (usually Executive Visual Snapshot), not detached at page end
visual/table labeling standards met
no privacy leakage
no actionable quality gaps remain
for collection-scope tasks, every in-scope report page was individually checked and opener clarity is verified
Quality checklist (core)
Top Takeaways is first
opener uses the clearest framing for the report: explicit Question + direct answer when question-driven, otherwise explicit objective/outcome framing
opener includes explicit question text when question-driven, or explicit objective text when objective-driven
when Question Decomposition exists, it uses strict Question n -> Answer n -> Status n ordering with one empirical evidence line per answer
vague status-only opener answers (for example mixed, answered, improved) are non-compliant unless mapped to explicit direct-answer tokens
opener framing is self-contained plain language (no shorthand requiring external context)
report naming is human-readable (<clear human title>) and avoids opaque shorthand/acronym-heavy labels
report date is stored in the Date property when using Quant Blog
report titles do not include hash-like run/search IDs
machine-facing traceability IDs are captured in a final Provenance Appendix (for experiment/search reports)
when repo context matters, reports include shareable repository references: repo, branch, PR, commit, and repo-root-relative paths; clickable GitHub links are preferred when available
rerun appears in a title only when the report links to the prior report/experiment it reruns
ownership is verified for any existing mutation target ([email protected] or Ollie (MCP))
failure-rate metrics are present for reliability context but are not headline focal points unless failure behavior is the explicit research question
no vague status-only opener wording (for example answered) without explicit question text
legacy first-summary sections (Key Takeaways/Executive Summary) apply the same explicit opener rule when Top Takeaways is absent
opener emphasis hierarchy is correct
no duplicate summary headings or repeated intro paragraphs remain after refinement
Experiment Definition includes what varied/fixed/how/why/answer
Eval Setup appears immediately after Experiment Definition and before results/plots sections
the main results section uses plain-language headings and a plot order that is easy to follow from simple to more complex
broad aggregate plots do not dominate the report when they are too coarse to support the main conclusions
duplicate sections, repeated intros, and stale wording are removed
short plot labels like n, r, or local environment names are explained clearly next to or on the relevant plots
timing plots clearly distinguish raw seconds from normalized rates
timing plots state the compute shape behind each point when that shape matters for interpretation (for example CPU-only vs GPU, CPUs, memory, wall-clock limit, or same-job-shape note)
standard single-run plots are present when available and useful
true individual episode plots are present when available and are clearly separated from training-path or checkpoint-path plots
experiment/comparison reports include explicit eval-time setup details:
checkpoint selection at eval
eval composition/overrides (for example learned-agent count/opponent setup)
domain-randomization behavior for key variables (for example fees/starting capital) at eval
explicit apples-to-apples comparability statement
Executive Visual Snapshot is present for quantitative reports and appears after Eval Setup
captions use Takeaway: lead-ins
captions are human-readable and do not default to raw uploaded file names
axis labels/units/legends are explicit and readable
plotted series are not downsampled (or any unavoidable downsampling is explicitly disclosed with method/factor/why)
time-series plots keep native artifact cadence (or any aggregation is explicitly disclosed with method/factor/why)
empirical charts are regenerated from original artifacts (no low-fidelity Notion-native chart DSL when source data exists)
limitations/reliability caveats are present and proportional
report is descriptive-only unless user explicitly requested recommendations
no table-of-contents block
no local path/hostname leakage
no personal machine paths in title/body/appendix; use repo-root-relative paths instead
exactly one footer attribution marker is present (Prepared with support from Codex and Claude. codex-managed: true) and there are no other Codex/Claude references in title/body
no literal markdown emphasis/code marker text remains in published Notion blocks (**...**, `...`)
visuals are positioned in the main report flow (typically under Executive Visual Snapshot) and not appended after appendix/footer
primary evidence sections are expanded in-place and not hidden behind toggles/collapsible sections unless explicitly requested
no third-party public file host usage unless explicitly approved by the user
image/file blocks are Notion-managed by default and resolve to non-empty payloads at verification time
if visuals are expected, upload capability is verified before publish (or placeholders are used when unavailable)
when run-level traces exist, Run Spotlight includes separate performance-path and inventory/book-shape visuals rather than only a single overview image
for collection-scope tasks, all in-scope report pages were audited one-by-one (no sampling)