Monitor running ML training experiments for divergence. Polls log files, detects NaN/explosion/plateau, and kills diverging processes. Use when: experiments are running and need to be watched for training issues.
Watches running training experiments for signs of divergence and takes corrective action.
Path convention: All paths written as
<exp_root>/...refer to theexp_rootparameter from your dispatch. The plugin does not hardcode the output directory name.
From the orchestrator:
log_files: List of log file paths to monitor (one per running experiment)exp_ids: Corresponding experiment IDsproject_root: Project root directorypoll_interval: How often to check (default: 30 seconds)metric_to_watch: Which metric to monitor for divergence (default: "loss"). The orchestrator passes the user's chosen divergence metric from Phase 0. Common values: "loss", "train_loss", , , , ."val_loss""objective""nll_loss""total_loss"lower_is_better: Whether lower values are better for the watched metric (default: true). The primary_metric (accuracy, PSNR, etc.) is used by analyze/hp-tune, not by monitor.explosion_threshold: Threshold multiplier for explosion detection (default: 5.0). Override based on model type.plateau_patience: Steps without improvement before plateau alarm (default: 20). Override based on model type.model_category (optional): From user_choices — "rl", "generative", or null. Controls RL-specific monitoring adjustments (see RL Model Monitoring section).overfitting_check (optional): When provided, also check for overfitting by comparing train vs val metrics. Dict with train_metric and val_metric names. Example: {"train_metric": "train_loss", "val_metric": "val_loss"}. Requires both metrics to be available in the training logs.# Check if training process is still running
ps aux | grep "<exp_id>" | grep -v grep
Or check for PID files in <exp_root>/logs/<round_dir>/<exp_id>/pidFor each monitoring cycle:
For each log file:
# Read the last N lines of the log file
tail -100 <exp_root>/logs/<round_dir>/<exp_id>/train.log
Parse the log content for the watched metric:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/parse_logs.py <exp_root>/logs/<round_dir>/<exp_id>/train.log
Extract the metric trajectory (all values of the watched metric over time).
If the watched metric is not found in the parsed records, attempt auto-detection:
metric_to_watch.lower() against all keys lowercasedtrain_<metric>, val_<metric>, <metric>_train (e.g., loss → train_loss, val_loss)
Disambiguation: If multiple prefix variants match (e.g., both train_loss and val_loss), prefer val_<metric> — validation loss is a better divergence signal than training loss. Log which variant was selected to dev_notes.metric_to_watch as a substring (e.g., "loss" matches "total_loss")unmonitored with the list of available metrics. The orchestrator handles this status by continuing without divergence checks but with a hard timeout fallback.Run divergence detection on the extracted trajectory, passing lower_is_better and model-category-aware thresholds:
python3 -c "
import json, sys
# sys.path: add the plugin's scripts/ directory
from detect_divergence import check_divergence, get_thresholds_for_category
from parse_logs import parse_log, extract_metric_trajectory
records = parse_log('<exp_root>/logs/<round_dir>/<exp_id>/train.log')
values = extract_metric_trajectory(records, '<metric>')
kwargs = get_thresholds_for_category('<model_category or None>')
kwargs['lower_is_better'] = <lower_is_better>
result = check_divergence(values, **kwargs)
print(json.dumps(result))
"
Alternatively, use the CLI with the --model-category flag:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/detect_divergence.py '<json_values>' --model-category <model_category>
This applies category-specific thresholds automatically (e.g., RL uses explosion_threshold=20.0 to avoid false positives on reward spikes).
overfitting_check provided)If overfitting_check was provided in the inputs:
Extract both metric trajectories from the log:
extract_metric_trajectory(records, overfitting_check["train_metric"])extract_metric_trajectory(records, overfitting_check["val_metric"])Run overfitting detection:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/detect_divergence.py --check-overfitting '<train_json>' '<val_json>' [--model-category <category>]
If overfitting detected:
"overfitting_warning" (do NOT kill the process — overfitting is a warning, not a hard failure)category: "overfitting", severity: "warning"${CLAUDE_PLUGIN_ROOT}/scripts/goal_memory.py <exp_root> log-behavior training_insight '{"insight":"Overfitting detected: <severity> at step <step>","source":"monitor"}'If divergence is detected:
Kill the training process:
# Option 1: PID file (preferred — most reliable)
kill $(cat <exp_root>/logs/<round_dir>/<exp_id>/pid) 2>/dev/null
# Option 2: Safe pattern match — verify process is a training process before killing
# First find candidates, then verify cmdline contains python/train before killing
for pid in $(pgrep -f "<exp_id>"); do
cmdline=$(cat /proc/$pid/cmdline 2>/dev/null | tr '\0' ' ')
if echo "$cmdline" | grep -qE 'python|train'; then
kill "$pid"
fi
done
Warning: Never use bare pkill -f "<exp_id>" — it could match unrelated processes.
Record the divergence:
status: "completed" or status: "failed", do NOT overwrite — the experiment finished first. Log to dev_notes: "Monitor detected divergence for <exp_id> but experiment already completed with status '<status>' — skipping overwrite." Skip to step 3.status: "running": update status to "diverged" and add divergence details to notes:
{
"status": "diverged",
"notes": "Divergence detected: <reason> at step <step>"
}
python3 schema_validator.py <exp_root>/results/<exp_id>.json result
If validation fails, fix the JSON before continuing.Log the event:
Append to <exp_root>/dev_notes.md:
## Divergence Detected
- Experiment: <exp_id>
- Reason: <reason>
- Step: <step>
- Action: Training process killed
Log to error tracker:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/error_tracker.py <exp_root> log '{"category":"divergence","severity":"warning","source":"monitor","message":"<divergence reason>","exp_id":"<exp_id>","context":{"divergence_type":"<nan|explosion|plateau|drift>","step":<step>,"metric_to_watch":"<metric>"}}'
Log divergence pattern to behavioral memory:
When an experiment is killed due to divergence, also log the pattern to behavioral memory:
python3 ${CLAUDE_PLUGIN_ROOT}/scripts/goal_memory.py <exp_root> log-behavior divergence_pattern '{"description":"<metric> diverged at step <N> with config <config>","affected_branches":["<branch>"],"threshold":{"parameter":"<hp>","value":<val>},"source":"monitor"}'
This helps future hp-tune iterations avoid configs that trigger divergence.
For healthy experiments, report status:
Monitoring status:
- exp-001: healthy (loss=0.45 at step 500, trending down)
- exp-002: DIVERGED (NaN at step 350) - process killed
- exp-003: healthy (loss=0.52 at step 480, trending down)
The monitor exits when:
See also hp-tune/references/tuning-strategy.md for per-model-type HP guidance that informs threshold selection.
| Model Type | Explosion Threshold | Plateau Patience | Notes |
|---|---|---|---|
| CNN (classification) | 5.0 | 20 | Standard defaults |
| Transformer | 10.0 | 30 | Loss can be spikier |
| GAN | 20.0 | 50 | Inherently noisy training |
| Diffusion model | 10.0 | 40 | Slow convergence is normal |
| Fine-tuning | 3.0 | 15 | Should converge faster |
When model_category = "rl", apply these adjustments:
| Metric | Polarity | Divergence Signal |
|---|---|---|
| policy_loss / actor_loss | lower is better | Standard: NaN/explosion/plateau |
| value_loss / critic_loss | lower is better | Standard: NaN/explosion/plateau |
| reward / episode_return | higher is better | Collapse: drops >50% from rolling max over 100 episodes |
| entropy | context-dependent | Entropy collapse: drops below 0.01 |
When divergence_metric is a reward metric (lower_is_better = False):
code_branch — worktree setup, dataset downloads, and dependency resolution add significant startup time). If the log file still doesn't exist after the wait period, check whether the experiment process is still alive (via PID file at <exp_root>/logs/<round_dir>/<exp_id>/pid). If the process is alive, extend the wait by another 60 seconds. If the process is dead or no PID file exists, report error immediately."no_output" with reason "Log file empty after 5 minutes — training may have stalled". Log to error tracker with category: "training_failure", severity: "warning", source: "monitor".Return to the orchestrator a dict per experiment:
exp_id: Experiment identifierstatus: One of healthy, diverged, completed, failed, no_outputreason: Divergence reason (if diverged) or nullstep: Step at which divergence was detected (if diverged) or -1latest_metrics: Dict of most recent metric valuesmetric_trajectory: List of watched metric values over timeWhen the watched metric is not found after fallbacks (Step 2b.1):
metric_trajectory: [] (empty — watched metric was never parsed)latest_metrics: All other available metrics from the final log line (the watched metric will be absent from this dict)reason: "Watched metric '<name>' not found; available: [<list>]"Important: These status values (
healthy,no_output) are internal monitor output for the orchestrator only. They must NOT be written to experiment result JSON files. Result files use:completed,failed,diverged,timeout.