Name: Run Experiments
Author: pengqianhan

Run Experiments | Skills Pool

train_loss: 0.342
test_accuracy: 0.891
val_f1: 0.756

np.save("experiment_results/train_losses.npy", train_losses)
np.save("experiment_results/test_accuracies.npy", test_accuracies)

Read memory: Check research_log.jsonl for Phase 1 findings. Re-read idea.md in case it was updated.
Implement each baseline method described in idea.md
Run all methods (proposed + baselines) under identical conditions:
- Same data splits
- Same evaluation protocol
- Same number of seeds (at least 3)
- Fair hyperparameter tuning for all methods

Save per-method results:

np.save("experiment_results/baseline_random_accuracies.npy", results)
np.save("experiment_results/proposed_method_accuracies.npy", results)

Generate comparison plots (bar charts, learning curves)

Write a summary:

summary = {
    "methods": {
        "proposed": {"accuracy": {"mean": 0.89, "std": 0.02}},
        "baseline_1": {"accuracy": {"mean": 0.82, "std": 0.03}},
    },
    "best_method": "proposed",
    "analysis": "Proposed method outperforms baseline by 7% on accuracy..."
}
with open("baseline_summary.json", "w") as f:
    json.dump(summary, f, indent=2)

→ Log & Evaluate: This is the most critical feedback point. If baselines outperform the proposed method, update the idea:
- Shift the claim (e.g., from "higher accuracy" to "better efficiency-accuracy tradeoff")
- Add new metrics that capture the actual advantage
- Or pivot the approach entirely
- Document the change in idea_evolution.md and research_log.jsonl

Read memory: Re-read idea.md — it may have been updated after Phase 2. Check idea_evolution.md to understand what changed and why.
Scale up from Stage 2:
- Full dataset (not subsampled)
- More evaluation metrics
- Multiple random seeds (3-5)
- Statistical significance tests if applicable
Generate comprehensive results:
- Learning curves (loss/metric vs. epoch for all methods)
- Performance tables (mean +/- std across seeds)
- Visualization of model outputs (if applicable: generated samples, attention maps, etc.)
- Distribution plots for key metrics

Save everything:

np.save("experiment_results/full_learning_curves.npy", curves)
np.save("experiment_results/full_comparison_results.npy", results)

Write research summary:

summary = {
    "key_findings": ["..."],
    "best_configuration": {...},
    "statistical_tests": {...},
    "analysis": "..."
}
with open("research_summary.json", "w") as f:
    json.dump(summary, f, indent=2)

→ Log & Evaluate: Full-scale results may reveal scaling behaviors not visible in Phase 2. If findings change the story, update the idea.

Read memory: Re-read idea.md — ablate the current version of the method, which may have evolved since Stage 1.
Identify the key components to ablate (from idea.md and Stage 3 results)
For each ablation:
- Remove or replace one component at a time
- Run with the same setup as Stage 3
- Record the impact on all metrics
Common ablation patterns:
- Remove a module → measure degradation
- Replace a component with a simpler alternative
- Vary a key hyperparameter across a range
- Test with different dataset sizes (scaling behavior)

Save ablation data:

np.save("experiment_results/ablation_no_attention.npy", results)
np.save("experiment_results/ablation_lr_sweep.npy", results)

Write ablation summary:

summary = {
    "ablations": [
        {
            "name": "without_attention",
            "change": "Removed self-attention layer",
            "impact": {"accuracy": -0.05, "loss": +0.12},
            "conclusion": "Attention is important for ..."
        }
    ],
    "analysis": "The most critical component is ..."
}
with open("ablation_summary.json", "w") as f:
    json.dump(summary, f, indent=2)

→ Log & Evaluate: If ablations show a component is unnecessary, consider simplifying the method and updating idea.md.

experiments/{idea_name}/
├── idea.md                       # Current idea (final version, may differ from v1)
├── idea.json                     # Metadata with version number
├── idea_evolution.md             # Complete history of idea changes
├── research_log.jsonl            # Full event log (all experiments + idea updates)
├── config.yaml
├── runfile.py                    # Main experiment script(s)
├── baseline_summary.json         # Stage 2 summary
├── research_summary.json         # Stage 3 summary
├── ablation_summary.json         # Stage 4 summary
├── experiment_results/
│   ├── train_losses.npy
│   ├── test_accuracies.npy
│   ├── baseline_*.npy
│   ├── ablation_*.npy
│   ├── full_*.npy
│   └── *.png                     # Intermediate diagnostic plots
├── figures/                      # (populated by aggregate-plots skill)
└── logs/
    └── experiment_log.txt

Run Experiments

Prerequisites

Core Principle: Idea-Experiment Feedback Loop

Run Experiments

Prerequisites

Core Principle: Idea-Experiment Feedback Loop

The 4 Experiment Stages

Stage 1: Initial Implementation (get it working)

Stage 2: Baseline Comparison (establish context)

Stage 3: Full Experiments (comprehensive results)

Stage 4: Ablation Studies (validate components)

Iteration and Debugging

Key Principles

Final Output Structure

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio