Use when a quest is ready for a concrete implementation pass or a main experiment run tied to a selected idea and an accepted baseline.
Use this skill for the main evidence-producing runs of the quest. The goal is to turn one selected route into one trustworthy measured result with the smallest valid amount of execution.
Follow the shared interaction contract injected by the system prompt. Keep run updates brief unless the measured result, blocker state, or next route changed materially. For ordinary active work, prefer a concise progress update once work has crossed roughly 6 tool calls with a human-meaningful delta, and do not drift beyond roughly 12 tool calls or about 8 minutes without a user-visible update.
Use quest/workspace planning files only when they help control a non-trivial run; otherwise keep the run contract small and move to the first decisive execution step.
shell_command / command_execution in this skill.bash_exec(...)artifact.git(...) before raw shell git commands.bash_exec(...), not native shell tools.The experiment stage should turn a selected idea into auditable evidence. It should preserve the strongest old experiment-planning and execution discipline:
The experiment stage is not just "run code".
It is the stage that converts an idea contract into evidence that other stages can trust.
It is also the stage that should decide the next route once the measured result exists.
Within the user's explicit constraints, maximize valid evidence per unit time and compute.
Prefer equivalence-preserving efficiency upgrades first: larger safe batch size, mixed precision, gradient accumulation, dataloader workers, cache reuse, checkpoint resume, precomputed features, and smaller pilots.
If a proposed efficiency change alters optimization dynamics, effective budget, or baseline comparability, treat it as a real experiment change and record it as such.
For comparison_ready, verify-local-existing, attach, or import should usually beat full reproduction.
Use references/evidence-ladder.md when deciding whether the current package is merely executable, solid enough to carry the main claim, or already in the stage where broader polish is justified.
Completing one main run is not quest completion. After reporting the run, keep moving to iterate, analyze, write, or finalize unless a genuine blocking decision remains.
When the quest is algorithm-first, treat experiment as the execution surface of optimize, not as the terminal goal of the workflow.
After a measured result, the default next move is frontier review and optimize-side route selection rather than paper packaging.
1-2 sentences and confirm the baseline comparison contractPLAN.md and CHECKLIST.mdPLAN.md to lock the concrete run path and CHECKLIST.md as the living execution surface1-2 sentence outcome summaryKeep the run loop simple: define the smallest valid run contract, use at most a bounded smoke or pilot when needed, run the real experiment, record the result, and route explicitly from the measured evidence.
run/* branch.artifact.record_main_experiment(...), route from the measured result:
optimize or decision for frontier review before launching another large run.Before a main run starts, confirm:
run/* target branch or isolated worktree for this exact main experimentIf any of these are materially unknown, stop and resolve them through decision.
Two execution tiers are acceptable:
lightweight run
durable main run
PLAN.md / CHECKLIST.md contract and dedicated run/* surfaceBefore substantial implementation work or a real main run, create a quest-visible PLAN.md and CHECKLIST.md.
references/main-experiment-plan-template.md as the canonical structure for PLAN.md.references/main-experiment-checklist-template.md as the canonical structure for CHECKLIST.md.PLAN.md and CHECKLIST.md are the canonical planning-and-control surface before and during execution.PLAN.md should lead with the selected idea summarized in 1-2 sentences and include the baseline and comparability rules, safe efficiency levers, minimal code-change map, smoke or pilot path, full-run path, fallback options, monitoring and sleep rules, expected outputs, and a revision log.PLAN.md.PLAN.md before spending more code or compute.Only modify the active quest workspace for this experiment line.
Respect explicit resource limits and record real environment or dependency constraints, but do not stop the run early just to over-document them.
Use:
bash_exec session ids, progress markers, and exported logs from the actual runDo not claim run success without durable outputs.
A meaningful experiment pass should leave behind:
artifacts/experiment/<run_id>/ or the quest-equivalent canonical locationartifact_manifest.json, run_manifest.json, metrics.json, and summary.mdmetrics.md and runlog.summary.md for durable main runsbash.logRecommended additional files:
claim_validation.mdrun_manifest.json should capture at least:
run_idIf a command needed for environment capture is unavailable, record that gap in the manifest and summary.
Before implementation or execution, state:
run_idauxiliary/dev or main/testPrefer to write this contract first in PLAN.md using references/main-experiment-plan-template.md, then keep the current execution state visible in CHECKLIST.md using references/main-experiment-checklist-template.md.
For substantial runs, also record the following seven experiment fields early and keep them updated during execution:
If the run contract changes materially later, record the change durably.
Treat the run contract as a research question contract, not only an execution checklist. Before coding, be able to explain:
If multiple candidate experiment packages exist, prefer the one with the best balance of technical feasibility, research importance, and methodological rigor. Do not choose a package only because it sounds ambitious.
For paper-facing lines, default to this evidence ladder:
auxiliary/dev
main/test
minimum -> solid -> maximum
Before editing or executing:
active_baseline_metric_contract_json, read that JSON file before planning commands or comparisonsactive_baseline_metric_contract_json as the default authoritative baseline comparison contract unless you record a concrete reason to override itIf a repeated failure pattern already exists, apply the mitigation first and record that choice.
Also confirm before comparison work:
active_baseline_metric_contract_json when that file is availableactive_baseline_metric_contract_json; extra metrics are allowed, but missing required metrics are notmain/test and superiority is likely to be claimed, define the significance-testing plan before execution rather than after seeing the numbersResult/metric.md was used during the run, treat it as optional scratch memory only and reconcile it against the final submitted metrics before artifact.record_main_experiment(...)Before you begin a substantial run, send a concise threaded artifact.interact(kind='progress', ...) update naming:
Switch from ordinary execution mode into diagnosis mode when any of the following becomes true:
In diagnosis mode:
The normal experiment workspace is the current active idea worktree returned by artifact.submit_idea(...).
artifact.submit_idea(mode='create', lineage_intent='continue_line'|'branch_alternative', ...) instead of silently mutating the old nodeartifact.record_main_experiment(...) before moving to analysis or writingcontinue_line child branch or branch_alternative sibling-like branchartifact.record_main_experiment(...), if QQ milestone media is enabled and the metrics are stable enough to summarize honestly, prefer one concise summary PNG over multiple attachmentsImplementation rules:
PLAN.md instead of repeatedly improvising a new method after each small observationPrefer to complete one experiment cleanly before expanding to the next, unless parallel execution is explicitly justified and isolated. For substantial experiment packages, the default is one experiment at a time, with each one reaching a recoverable recorded state before the next begins.
Retry-delta discipline:
decisiondecisionRun with auditable commands and durable outputs.
Execution rules:
bash_exec instead of ephemeral shell invocationsactive_baseline_metric_contract_json silently when that file existsYou may do a quick sanity run first, but if the stage goal is a real experiment you must continue to the real evaluation unless the run is blocked and recorded.
Pilot-before-scale rule:
Incremental-recording rule:
CHECKLIST.md alongside those durable notes so the current execution frontier is obvious without replaying the whole logsuccess, partial, or failureidea_id, branch, and run_idLast-known-good rule:
For commands that may run longer than a few minutes:
0-2 for the current experiment pass0-2 budget rather than as a mandatory separate phasebash_exec(mode='detach', ...) and normally leave timeout_seconds unset for that long runbash_exec(mode='read', id=...) returns the full rendered log when it is 2000 lines or fewer; for longer logs it returns the first 500 lines plus the last 1500 lines and a hint to inspect omitted sections with start and tailbash_exec(mode='read', id=..., start=..., tail=...)bash_exec(mode='list') and bash_exec(mode='read', id=..., tail_limit=..., order='desc') to monitor or revisit managed commands while focusing on the newest evidence firstbash_exec(mode='read', id=..., after_seq=last_seen_seq, tail_limit=..., order='asc') so later checks only fetch new evidencebash_exec(mode='history')comment such as {stage, goal, action, expected_signal, next_check}silent_seconds, progress_age_seconds, signal_age_seconds, and watchdog_overdue from bash_exec(mode='list'|'read', ...) as your default watchdog signals60s, then inspect logs120s, then inspect logs300s, then inspect logs600s, then inspect logs1800s, then inspect logs1800s while the run is still activebash_exec(command='sleep 60', mode='await', timeout_seconds=70) or bash_exec(mode='await', id=..., timeout_seconds=...) between checksbash_exec(command='sleep N', mode='await', timeout_seconds=N+buffer, ...)timeout_seconds exactly equal to Nbash_exec(mode='await', id=..., timeout_seconds=...) instead of starting a new sleep commandartifact.interact(kind='progress', ...) when the user-visible state, frontier, blocker status, or ETA materially changedbash_exec(mode='kill', id=..., wait=true, timeout_seconds=...); if it must die immediately, add force=true, record the reason, fix the issue, and relaunch cleanlytqdm progress reporter and concise structured progress markers when feasibleAlways preserve the managed bash_exec log and export it into the experiment artifact directory when the run artifact is written.
If the run emits progress markers, keep them concise and machine-readable instead of narrating every low-level update in chat.
When a real checkpoint is reached, include the estimated next reply time and next_reply_at when that is honestly knowable.
After the run, verify:
Create a durable claim-validation record that maps:
supportedrefutedinconclusiveAlso verify baseline comparability before claiming deltas:
Every meaningful main run must be recorded through artifact.record_main_experiment(...).
That call is responsible for writing:
experiments/main/<run_id>/RUN.mdexperiments/main/<run_id>/RESULT.jsonrun artifact payloadartifact.record_main_experiment(...) should include at least:
run_idmetrics_summarymetric_rows when availableevaluation_summary with exactly these six fields:
takeawayclaim_updatebaseline_relationcomparabilityfailure_modenext_actionUse evaluation_summary as the short structured judgment layer on top of the longer narrative fields:
takeaway: one sentence the next reader can reuse directlyclaim_update: strengthens, weakens, narrows, or neutralbaseline_relation: better, worse, mixed, or not_comparablecomparability: high, medium, or lowfailure_mode: none, implementation, evaluation, environment, or directionnext_action: the immediate route such as continue, revise_idea, analysis_campaign, write, or stopAfter artifact.record_main_experiment(...) succeeds, do not assume the same branch should absorb the next round by default.
Interpret the measured result first, then either:
Use artifact.create_analysis_campaign(...) only when the extra slices have clear academic or claim-level value relative to their resource cost.
If the main need is simply to continue optimization from a measured result, prefer a new durable child idea branch instead of an expensive analysis package by reflex.
If the extra work should happen on an older durable branch rather than the current head, first switch the runtime back there with artifact.activate_branch(...), then launch the analysis campaign from that activated workspace.
When artifact.record_main_experiment(...) succeeds, send a richer threaded artifact.interact(kind='milestone', ...) update rather than a generic one-line progress ping.
Lead that milestone with a concise 1-2 sentence outcome summary before expanding into more detail.
That milestone should state:
Do not treat a main run as durably complete until artifact.record_main_experiment(...) succeeds.
Recommended per-run documentation fields:
For durable main runs, these seven fields should be progressively filled as the run advances, not only at final packaging time. For lightweight runs, a shorter summary is acceptable if the route remains obvious and the result is still durably recorded.
RUN.md should make it easy for later stages to answer:
Recording rules:
The experiment stage should normally end with one of:
Do not let the stage end without an explicit next direction. If analysis is selected, record why the expected information gain is strong enough to justify the added compute, time, or annotation budget.
A credible main run should satisfy:
If the result is confounded, say so directly.
Before marking the run complete, verify all of the following:
active_baseline_metric_contract_json when that file existsIf these checks fail, record the run as partial or blocked rather than pretending it is complete.
Stage-start requirement:
memory.list_recent(scope='quest', limit=5)memory.search(...) before reopening a previously tested command path or retrying an old runStage-end requirement:
memory.write(...) before leaving the stageUse memory only to avoid repeating known failures or to preserve reusable experiment lessons; the canonical run record belongs in artifact.
progress for long-running execution updatesartifact.record_main_experiment(...) for each meaningful completed main experimentreport for suspicious-result investigations or analysis-rich summaries when they materially help the next routedecision for continue / branch / analysis / write / reset / stopapproval when an explicit user approval is captured for an expensive or risky run changeartifact.checkpoint(...) when code evolution is meaningful and should be preserved in Gitartifact.interact(kind='progress' | 'milestone', ...) so the user sees the concrete result and next stepA failed main run is still useful if it is explained well.
Record what was attempted, where the failure occurred, whether it was methodological or infrastructural, what retry/branch/reset is justified, and the single best next action.
Prefer a primary failure type such as data_contract_mismatch, resource_exhausted, numeric_instability, implementation_bug, evaluation_pipeline_failure, external_dependency_blocked, or direction_underperforming.
Also classify the broader failure layer when possible: implementation, evaluation, environment, or direction.
Blocked experiment states commonly include missing baseline reference, unknown metric contract, environment failure, run failure before metrics, or metrics that are not comparable.
When results are suspicious, fix the subset and seeds, isolate preprocessing/model/training/evaluation one by one, compare intermediate outputs on the same inputs, and run the cheapest discriminative check before another full retry.
Exit the experiment stage once one of the following is durably true:
analysis-campaign, write, another experiment, or resetA good experiment pass leaves one interpretable result or one explicit blocker, not another vague promise to rerun later.