This skill should be used when the user asks to "create a new experiment", "start an experiment", "新しい実験", "実験を作って", "run training", "train a model", "record results", "結果を記録", "plan next experiment", "次の実験を考えて", "review experiment history", "実験の履歴を見て", or wants to follow the experiment lifecycle (plan, create, implement, train, record).
Guide the full experiment lifecycle: plan, create, implement, train, and record results.
| Phase | Action |
|---|---|
| Understand | Review competition docs in backlog (backlog doc list) |
| Plan | Review backlog and past experiments, create experiment task |
| Create | task new-exp EXP=expXXX to create experiment directory and backlog task |
| Implement | Write train.py, settings.py, run code quality checks |
| Train | task train-local or task train-vertex |
| Record | Update backlog task with results |
Before planning any experiment, review the competition documentation stored in backlog:
backlog doc list # List all documents
backlog doc view DOC-N # Read a specific document
Competition documents (overview, data description, evaluation metric, etc.) are managed as backlog documents. Check what's available and review relevant materials before designing experiments.
Before starting a new experiment, review the backlog and past experiments:
backlog search --type task exp --plain # Search experiment tasks
backlog overview # Project-level summary
Every experiment MUST have a corresponding backlog task. Create it before starting implementation:
backlog task create "expXXX: Short description of experiment" \
-d "Hypothesis: ... / Changes from base: ... / Expected outcome: ..." \
-l exp -l expXXX \
--ac "Training completes without errors" \
--ac "CV score recorded" \
--priority medium
Required conventions:
exp: All experiment tasks MUST have the exp label for filteringexpXXX: All experiment tasks MUST have the experiment name (e.g., exp001) as a labelexp002: ...)--dep TASK-N (the parent experiment's task)task new-exp EXP=exp002 # From template
task new-exp EXP=exp002 SOURCE=exp001 # Copy from existing experiment
This creates models/exp002/ with train.py, settings.py, and inference.py. A backlog task is automatically created with labels exp and exp002.
If models/exp002/submission/ exists after creation, this is a Kaggle code competition — the model must be submitted as a Kaggle kernel, not a CSV file. The submission/ directory is auto-included when competition_platform: kaggle and is_code_competition: true in project.yml. Override with KAGGLE_CODE_SUB=true or KAGGLE_CODE_SUB=false.
After creation, update the backlog task with experiment details:
backlog task edit TASK-N -d "Hypothesis: ... / Changes from base: ... / Expected outcome: ..."
backlog task edit TASK-N --plan "Implementation approach: ..."
Before implementing train.py, decide the cross-validation strategy. The strategy must match how the test set was constructed.
For the decision flow and code examples, read .claude/skills/experiment-workflow/references/validation-strategy.md.
For code competitions where test data is hidden, check public notebooks/discussions or ask the user to confirm assumptions about the test split.
Quick reference:
| Condition | Strategy |
|---|---|
| Time-series problem | TimeSeriesSplit |
| Train/test split by distinct groups | StratifiedGroupKFold |
| Categorical target or imbalanced classes | StratifiedKFold |
| Multi-label classification | MultilabelStratifiedKFold |
| None of the above | KFold |
Record the chosen validation strategy in the backlog task description or plan.
train.py must use tyro.cli with a main() function to support CLI arguments (e.g., --debug):
from settings import Config, DirectorySettings
def predict(model, df, ...):
"""推論処理。inference.pyからも呼び出される。"""
...
def main(debug: bool = False) -> None:
settings = DirectorySettings(exp_name="expXXX")
config = Config()
if debug:
settings.artifact_dir = settings.artifact_dir / "debug"
settings.output_dir = settings.artifact_dir
config.epochs = 1
# ... data loading, training, model save ...
# バリデーション推論
val_predictions = predict(model, val_df)
# 評価メトリクスを計算し、OOF予測をCSVに保存
if __name__ == "__main__":
import tyro
tyro.cli(main)
Key conventions:
train.py only. Do not put tracking setup in settings.py or inference.py.main(), invoked via tyro.cli(main)if __name__ == "__main__" guard is required (enables safe import by inference.py)predict() is defined in train.py; inference.py imports it via from train import predictdebug: bool = False is the standard flag; add other CLI args as neededpredict() (same pipeline as submission) and compute evaluation metricsartifact_dirmetrics.json to artifact_dir: Include CV score, per-fold scores, and config. This remains useful alongside MLflow tracking.artifact_dir: Visualize OOF predictions vs ground truth. Choose plots appropriate for the task (e.g., scatter + residuals for regression, confusion matrix + calibration for classification).Config (in settings.py). Do not use module-level constants for tunable values. This centralizes experiment configuration and makes it easy to compare settings across experiments.inference.py is the submission pipeline that runs on internet-off environment. It must be self-contained and produce the final submission.
Requirements:
predict() from train.py: from train import predict — inference logic is defined in train.py and sharedmain() + if __name__ == "__main__" guard: Wrap execution in main() with a guard to allow safe importsinput_dirartifacts_dirsubmission.csv to output_dir, or use the evaluation API if required)Submission format: Follow the competition's specification exactly. Check the competition overview and data description documents in backlog for the expected output format, column names, and file naming conventions.
After writing code, always run:
/simplify to review and simplifytask fmt (ruff check --fix + ruff format)task ty (ty check)After implementation and code quality checks pass, commit the changes using the /commit-commands:commit skill.
Always run training commands in the background using run_in_background: true on the Bash tool. Training can take minutes to hours, and blocking the conversation prevents the user from doing other work. After launching, inform the user that training is running and they can check progress with TaskOutput.
Before starting training, use AskUserQuestion to ask the user how they want to run training. Present the available options:
task train-vertex EXP=expXXX ACCELERATOR_TYPE=NVIDIA_L4task train-vertex EXP=expXXX ACCELERATOR_TYPE=NVIDIA_TESLA_V100task train-vertex EXP=expXXXtask train-local EXP=expXXXtask train-local EXP=expXXX EXTRA_ARGS="--debug" or task train-vertex EXP=expXXX EXTRA_ARGS="--debug"GPU-requiring tasks should default to NVIDIA_L4. Machine type is auto-resolved from accelerator type by GpuConfig in src/kaggle_ops/vertex.py.
| Accelerator | Default Machine Type | Command |
|---|---|---|
| NVIDIA_L4 (default) | g2-standard-8 | task train-vertex EXP=expXXX ACCELERATOR_TYPE=NVIDIA_L4 |
| NVIDIA_TESLA_V100 | n1-highmem-8 | task train-vertex EXP=expXXX ACCELERATOR_TYPE=NVIDIA_TESLA_V100 |
| NVIDIA_TESLA_A100 | a2-highgpu-1g | task train-vertex EXP=expXXX ACCELERATOR_TYPE=NVIDIA_TESLA_A100 |
| CPU only | n1-highmem-8 | task train-vertex EXP=expXXX |
task train-local EXP=exp002 # Run training locally
task train-vertex EXP=exp002 ACCELERATOR_TYPE=NVIDIA_L4 # Vertex AI with L4 (auto machine type)
task train-local EXP=exp002 EXTRA_ARGS="--debug" # Debug run locally (epochs=1, data limited)
task train-vertex EXP=exp002 EXTRA_ARGS="--debug" # Debug run on Vertex AI
task run-local SCRIPT=models/exp002/inference.py # Run inference locally
Use EXTRA_ARGS="--debug" to run in debug mode. This is useful for verifying end-to-end pipeline correctness before launching a full training run.
Debug mode applies these overrides in train.py:
artifacts/debug/ to avoid mixing with production artifactsdebug=trueEXTRA_ARGS is a generic parameter that passes arbitrary arguments to train.py. On Vertex AI, arguments are forwarded via vertex.py's extra_args through the container entrypoint.
After training completes, immediately record the CV score in the backlog task. Do not wait for user instruction — this is an automatic step after every successful training run.
# Recode CV score immediately after training completes
backlog task edit TASK-N --append-notes "CV score: 0.8765 (config summary)"
backlog task edit TASK-N --check-ac 1 --check-ac 2
LB score is recorded later when the user provides feedback after submission (e.g., "Public LB: 0.8750"). At that point, update the task with the full summary:
backlog task edit TASK-N --append-notes "Public LB: 0.8750"
backlog task edit TASK-N --final-summary "CV=0.8765, LB=0.8750. Next: try feature X (see TASK-M)"
backlog task edit TASK-N -s "Done"
DirectorySettings and path resolution across environments, read .claude/skills/experiment-workflow/references/directory-settings.md.engineer_features pattern (stateless/stateful separation, f_ prefix, encoder block, polars), read .claude/skills/experiment-workflow/references/tabular-feature-engineering.md..claude/skills/experiment-workflow/references/validation-strategy.md.backlog skill.mlflow-primary skill.