Use when a quest needs one or more follow-up runs such as ablations, robustness checks, error analysis, or failure analysis after a main experiment.
Use this skill when follow-up evidence is needed after a durable result. The goal is to answer a bounded evidence question, not to keep opening more slices because they are imaginable. All supplementary experiments after a durable result use this shared protocol: ordinary analysis, review-driven evidence gaps, rebuttal-driven extra runs, and write-gap follow-up experiments.
Follow the shared interaction contract injected by the system prompt. Keep campaign updates brief unless evidence boundary, blocker state, cost, or next route changed materially. For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update. For meaningful long-running slices, include the estimated next reply time or next check-in window whenever defensible.
shell_command / command_execution in this skill.bash_exec(...)bash_exec sessions for long-running slices instead of relaunching blindly.The agent owns the analysis path: slice ordering, workspace layout, filenames, environment route, monitoring strategy, and whether to use smoke, direct verification, or the real run first.
Do not treat PLAN.md, CHECKLIST.md, paper-matrix files, smoke tests, detached runs, tqdm, or a fixed phase order as required paths.
They are tactics.
The hard requirement is traceable slice-level evidence that changes, confirms, or blocks the evidence boundary of the parent claim and leaves an explicit next route.
Artifact state is not optional for supplementary experiments that launch slices. If a slice is launched as part of analysis-campaign, use the artifact flow below.
Analysis-campaign has a hard artifact boundary.
artifact.create_analysis_campaign(...) with the currently justified slice list when the work is an analysis campaign, affects durable lineage, needs Canvas/branch visibility, supports paper/rebuttal/review claims, or has more than one slice.artifact.create_analysis_campaign(...) returns slice worktrees, run each returned slice in its returned workspace unless a recorded reason makes another location more faithful.artifact.record_analysis_slice(...) with the honest outcome.artifact.record_analysis_slice(...) with chat, memory, a local note, or a campaign summary for any launched slice.artifact.resolve_runtime_refs(...), artifact.get_analysis_campaign(...), artifact.get_quest_state(...), or artifact.list_paper_outlines(...) instead of guessing.For writing-facing campaigns, include available paper-mapping fields such as selected_outline_ref, research_questions, experimental_designs, and todo_items when they exist and matter.
Treat campaign_id as system-owned, and treat slice_id / todo_id as agent-authored semantic ids.
An analysis campaign succeeds when it changes or confirms the evidence boundary of a parent claim with traceable slice-level evidence, preserves comparability or records why comparability broke, and leaves a durable next-route decision.
Before treating analysis as successful, all applicable gates must be true:
experiment, return to idea, move to write, route through decision, stop, reset, or record a blockerDo not aggregate campaign conclusions without per-run evidence. Do not bury null or contradictory findings.
Use the lightest route that preserves trust and downstream utility, including efficiency or cost questions when they affect the claim.
analysis-lite: one clear follow-up question and one slice or very small slice set, including small highlight-validation or efficiency / cost checks when they affect the claimartifact-backed campaign: any launched supplementary slice that needs durable lineage, branch/worktree isolation, Canvas visibility, or later replaywriting-facing campaign: evidence supports a selected outline, paper experiment matrix, evidence ledger, section, claim, or tablereview/rebuttal campaign: evidence answers reviewer pressure or audit findingsfailure-analysis route: evidence explains why a result failed, diverged, or became non-comparableUseful slice classes:
auxiliary: helps understand settings, thresholds, or mechanisms but does not carry the main claim by itselfclaim-carrying: directly affects whether the main narrative or route decision is justifiedsupporting: broadens confidence or interpretability after the main claim is credibleStart the smallest route that can answer the current question.
Run claim-critical slices first and stop widening once the next route is clear.
For campaign prioritization and writing-facing slice design, read references/campaign-design.md; for mapping examples, read references/writing-facing-slice-examples.md.
For each meaningful slice, define and record:
Code-based, fully automatable analysis is preferred when it is the most faithful and repeatable path. Failure-bucket inspection, qualitative artifact review, extracted-text audits, reviewer-linked example checks, and table/figure consistency checks can still be valid when evidence is concrete, scoped, and reproducible enough for the claim. Do not present subjective judgment as objective measurement; record rubric, sample, prompt or inspection basis, caveats, and why it is sufficient.
evaluation_summary is the preferred stable routing summary for UI, Canvas, review, and rebuttal.
When useful, include takeaway, claim_update, baseline_relation, comparability, failure_mode, and next_action.
Comparability is a hard boundary.
active_baseline_metric_contract_json exists, read it before defining slice success criteria or comparison tables when baseline comparison mattersactive_baseline_metric_contract_json exists, keep slice comparisons aligned with it unless the slice explicitly records why it differsA new dataset can be valid as a generalization, external-validity, stress-test, or limitation-boundary slice, but it must be labeled that way and must not replace the accepted baseline or main comparison contract.
If a slice needs an extra comparator baseline, place it under normal baseline roots, do not overwrite the canonical quest baseline gate, and record it through record_analysis_slice(..., comparison_baselines=[...]).
Paper-facing evidence must be write-backable. Paper-ready slices must map cleanly back to a selected outline, paper experiment matrix, evidence ledger, section, claim, table, reviewer item, or rebuttal item.
main_required / main_textappendixreference_onlypaper/paper_experiment_matrix.md exists and the campaign supports the paper, read it before launching or reordering slicesexp_id, todo_id, or slice_idpaper_role, section_id, item_id, and claim_linksIf no selected outline exists yet but the evidence question decides whether writing is worthwhile, run it as pre-outline analysis and route to write or decision afterward.
Durable records are required in substance, not in fixed filenames. For multi-slice, writing-facing, route-changing, expensive, unstable, or long-running analysis, leave a route record with:
PLAN.md, CHECKLIST.md, paper/paper_experiment_matrix.md, and local matrix/checklist files are control surfaces, not mandatory success paths.
Use references/campaign-plan-template.md and references/campaign-checklist-template.md only when they reduce ambiguity.
0-2 default budget; do not repeat unchanged checks.bash_exec(mode='detach', ...) plus managed monitoring.bash_exec(mode='read', id=...) returns full logs when 2000 lines or fewer; longer logs return the first 500 lines plus last 1500 lines.bash_exec(mode='read', id=..., start=..., tail=...) for omitted middle sections.tail_limit=..., order='desc', then after_seq=last_seen_seq for incremental reads.bash_exec(mode='history').silent_seconds, progress_age_seconds, signal_age_seconds, and watchdog_overdue as stall checks when available.bash_exec(mode='kill', id=..., wait=true, timeout_seconds=...).bash_exec(mode='await', id=..., timeout_seconds=...); otherwise use bash_exec(command='sleep N', mode='await', timeout_seconds=N+buffer, ...) and do not set timeout_seconds exactly equal to N.bash_exec(mode='await', id=..., timeout_seconds=...) instead of starting a new sleep command when you are already waiting on a launched slicetqdm progress reporter and concise structured progress markers when feasible.Do not treat analysis as successful when:
If the same failure class appears again without a real route or evidence change, stop widening and route through decision, write, experiment, or an explicit blocker.
If two slices in a row fail to change claim boundary, matrix frontier, or next route, stop widening and route through decision, write, experiment, or an explicit blocker.
A blocked campaign must state the failure class, what was tried, evidence paths, and next best action.
Campaign reporting should classify stable support, partial support, contradiction, and unresolved ambiguity.
It should state whether the main claim is strengthened, weakened, narrowed, abandoned, or still ambiguous.
Summarize the top 3-5 findings first when there are many slices.
Use memory only to avoid repeated failures or preserve reusable campaign lessons.
At stage end, write memory.write(...) only for durable cross-slice lessons, failure patterns, or comparability caveats.
Exit once one of these is durably true:
experiment, idea, baseline recovery, or decisionA good campaign closes when the claim got stronger, weaker, narrower, abandoned, or clearly stuck, not when more slice ideas remain possible.