Workflow 1.5: Bridge between idea discovery and auto review. Reads EXPERIMENT_PLAN.md, implements experiment code, deploys to GPU, collects initial results. Use when user says "实现实验", "implement experiments", "bridge", "从计划到跑实验", "deploy the plan", or has an experiment plan ready to execute.
Implement and deploy experiments from plan: $ARGUMENTS
This skill bridges Workflow 1 (idea discovery + method refinement) and Workflow 2 (auto review loop). It takes the experiment plan and turns it into running experiments with initial results.
Workflow 1 output: This skill: Workflow 2 input:
refine-logs/EXPERIMENT_PLAN.md → implement → GPT-5.4 review → deploy → collect → initial results ready
refine-logs/EXPERIMENT_TRACKER.md code (cross-model) /run-experiment for /auto-review-loop
refine-logs/FINAL_PROPOSAL.md
false to skip.falsefalse (default), write code from scratch or reuse existing project files.true, (1) read idea-stage/IDEA_CANDIDATES.md instead of full idea-stage/IDEA_REPORT.md if available, (2) append experiment results to EXPERIMENT_LOG.md after collection.Override:
/experiment-bridge "EXPERIMENT_PLAN.md" — compact: true, base repo: https://github.com/org/project
This skill expects one or more of:
refine-logs/EXPERIMENT_PLAN.md (best) — claim-driven experiment roadmap from /experiment-planrefine-logs/EXPERIMENT_TRACKER.md — run-by-run execution tablerefine-logs/FINAL_PROPOSAL.md — method description for implementation contextidea-stage/IDEA_CANDIDATES.md — compact idea summary (preferred when COMPACT: true) (fall back to ./IDEA_CANDIDATES.md if not found)idea-stage/IDEA_REPORT.md — full brainstorm output (fall back to ./IDEA_REPORT.md if not found)If none exist, ask the user what experiments to implement.
Read EXPERIMENT_PLAN.md and extract:
FINAL_PROPOSAL.md — what exactly to implementPresent a brief summary:
📋 Experiment plan loaded:
- Milestones: [N] (sanity → baseline → main → ablation)
- Must-run experiments: [N]
- Nice-to-have: [N]
- Estimated GPU-hours: [X]
Proceeding to implementation.
If BASE_REPO is set — clone the repo first:
git clone <BASE_REPO> base_repo/
# Read the repo's README, understand its structure, find entry points
# Implement experiments by modifying/extending this codebase
For each milestone (in order), write the experiment scripts:
Check existing code — scan the project (or cloned base_repo/) for existing experiment scripts, model code, data loaders. Reuse as much as possible.
Implement missing pieces:
Follow the plan's run order — implement sanity-stage experiments first, then baselines, then main method, then ablations.
Self-review before deploying:
Skip this step if CODE_REVIEW is false.
Before deploying, send the experiment code to GPT-5.4 xhigh for review:
mcp__codex__codex:
config: {"model_reasoning_effort": "xhigh"}
prompt: |
Review the following experiment implementation for correctness.
## Experiment Plan:
[paste key sections from EXPERIMENT_PLAN.md]
## Method Description:
[paste from FINAL_PROPOSAL.md]
## Implementation:
[paste the experiment scripts]
Check for:
1. Does the code correctly implement the method described in the proposal?
2. Are all hyperparameters from the plan reflected in the code?
3. Are there any logic bugs (wrong loss function, incorrect data split, missing eval)?
4. Is the evaluation metric computed correctly?
5. **CRITICAL: Does evaluation use the dataset's actual ground truth labels — NOT another model's output as ground truth?** This is a common and severe bug.
6. Any potential issues (OOM risk, numerical instability, missing seeds)?
For each issue found, specify: CRITICAL / MAJOR / MINOR and the exact fix.
On review results:
Before deploying the full experiment suite, run the sanity-stage experiment:
/run-experiment [sanity experiment command]
Wait for completion. Verify:
If sanity fails → auto-debug before giving up (max 3 attempts):
/codex:rescue to get a second opinion on the root cause. Codex independently reads the code and error logs — it may spot issues Claude missed (wrong tensor shapes, subtle import shadowing, config mismatches, etc.). Apply its suggested fix, then re-run.
/codex:rescue is not available (plugin not installed), continue with Claude's own diagnosisNever give up on the first failure. Most experiment crashes are fixable without human intervention.
Deploy experiments following the plan's milestone order. Route by job count:
Small batch (≤5 jobs per milestone) → use /run-experiment directly:
/run-experiment [experiment commands]
Large batch (≥10 jobs, multi-seed sweeps, or phase dependencies) → use /experiment-queue for proper orchestration:
/experiment-queue [grid spec or manifest]
Auto-routing rule: if any milestone in EXPERIMENT_PLAN.md declares ≥10 jobs (e.g., seeds: [42, 200, 201, ...] × N: [64, 128, 256] × n: [50K, 150K, 500K, 652K] = 36 jobs) or declares teacher→student phase dependencies, route that milestone to /experiment-queue. Otherwise use /run-experiment.
/experiment-queue adds: OOM-aware retry with backoff, stale-screen cleanup, wave-transition race prevention, phase dependency enforcement, crash-safe state persistence in queue_state.json. See skills/experiment-queue/SKILL.md for the manifest YAML format.
For each milestone:
/run-experiment, or max_parallel from manifest for /experiment-queue)/monitor-experiment to track progress (reads from queue_state.json if /experiment-queue is active)🚦 Checkpoint (if AUTO_DEPLOY = false):
🔧 Code implementation complete. Ready to deploy:
Milestone 0 (sanity): [status — passed/pending]
Milestone 1 (baseline): [N experiments, ~X GPU-hours]
Milestone 2 (main method): [N experiments, ~X GPU-hours]
Milestone 3 (ablations): [N experiments, ~X GPU-hours]
Total estimated: ~X GPU-hours on [N] GPUs
Deploy now? Or review the code first?
As experiments complete:
wandb: true and wandb_project), invoke /training-check to detect NaN, loss divergence, plateaus, or overfitting. If W&B is not configured, skip silently.refine-logs/EXPERIMENT_TRACKER.md — fill in Status and Notes columns# Initial Experiment Results
**Date**: [today]
**Plan**: refine-logs/EXPERIMENT_PLAN.md
## Results by Milestone
### M0: Sanity — PASSED
- [result]
### M1: Baselines
| Run | System | Key Metric | Status |
|-----|--------|-----------|--------|
| R001 | baseline_1 | X.XX | DONE |
### M2: Main Method
| Run | System | Key Metric | Status |
|-----|--------|-----------|--------|
| R003 | our_method | X.XX | DONE |
### M3: Ablations
...
## Summary
- [X/Y] must-run experiments completed
- Main result: [positive/negative/inconclusive]
- Ready for /auto-review-loop: [YES/NO]
## Next Step
→ /auto-review-loop "[topic]"
Skip entirely if COMPACT is false.
Append each completed experiment to EXPERIMENT_LOG.md:
## [Run ID] — [timestamp]
- **System**: [method name]
- **Config**: [key hyperparameters]
- **Result**: [primary metric = X.XX]
- **Verdict**: [positive / negative / inconclusive]
- **Reproduce**: `python train.py --config configs/run_id.yaml --seed 42`
This structured log survives session recovery — downstream skills read it instead of parsing screen output.
After main experiments (M2) complete with positive results, invoke /ablation-planner to design ablation studies:
refine-logs/EXPERIMENT_PLAN.md and refine-logs/EXPERIMENT_TRACKER.mdIf /ablation-planner is not available, skip silently — the existing EXPERIMENT_PLAN.md ablation blocks (if any) remain unchanged.
Present final status:
🔬 Experiment bridge complete:
- Implemented: [N] experiment scripts
- Deployed: [N] experiments on [M] GPUs
- Completed: [X/Y] must-run, [A/B] nice-to-have
- Main result: [one sentence]
Results: refine-logs/EXPERIMENT_RESULTS.md
Tracker: refine-logs/EXPERIMENT_TRACKER.md
Ready for Workflow 2:
→ /auto-review-loop "[topic]"
Follow these shared protocols for all output files:
- Output Versioning Protocol — write timestamped file first, then copy to fixed name
- Output Manifest Protocol — log every output to MANIFEST.md
- Output Language Protocol — respect the project's language setting
EXPERIMENT_TRACKER.md should reflect real status after each run completes./vast-gpu destroy or /vast-gpu destroy-all when done.gpu: modal, no cleanup is needed — Modal auto-scales to zero after each run. But always show cost estimates before running and verify the spending limit is set at https://modal.com/settings (NEVER through CLI)./idea-discovery "direction" ← Workflow 1: find + refine + plan
/experiment-bridge ← you are here (Workflow 1.5: implement + deploy)
/auto-review-loop "topic" ← Workflow 2: review + iterate
/paper-writing "NARRATIVE_REPORT.md" ← Workflow 3: write the paper
Or use /research-pipeline for the full end-to-end flow (includes this bridge).