Monitor | Skills Pool

"val_loss"

"objective"

"nll_loss"

"total_loss"

Verify each log file path exists (or its directory exists — file may not be created yet)
Verify corresponding experiment scripts are running:
```
# Check if training process is still running
ps aux | grep "<exp_id>" | grep -v grep
```
Or check for PID files in <exp_root>/logs/<round_dir>/<exp_id>/pid

# Read the last N lines of the log file
tail -100 <exp_root>/logs/<round_dir>/<exp_id>/train.log

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/parse_logs.py <exp_root>/logs/<round_dir>/<exp_id>/train.log

python3 -c "
import json, sys
# sys.path: add the plugin's scripts/ directory
from detect_divergence import check_divergence, get_thresholds_for_category
from parse_logs import parse_log, extract_metric_trajectory

records = parse_log('<exp_root>/logs/<round_dir>/<exp_id>/train.log')
values = extract_metric_trajectory(records, '<metric>')
kwargs = get_thresholds_for_category('<model_category or None>')
kwargs['lower_is_better'] = <lower_is_better>
result = check_divergence(values, **kwargs)
print(json.dumps(result))
"

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/detect_divergence.py '<json_values>' --model-category <model_category>

Extract both metric trajectories from the log:
- Train metric: extract_metric_trajectory(records, overfitting_check["train_metric"])
- Val metric: extract_metric_trajectory(records, overfitting_check["val_metric"])

Run overfitting detection:

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/detect_divergence.py --check-overfitting '<train_json>' '<val_json>' [--model-category <category>]

If overfitting detected:
- Report status "overfitting_warning" (do NOT kill the process — overfitting is a warning, not a hard failure)
- Log to error tracker: category: "overfitting", severity: "warning"
- Log to behavioral memory: ${CLAUDE_PLUGIN_ROOT}/scripts/goal_memory.py <exp_root> log-behavior training_insight '{"insight":"Overfitting detected: <severity> at step <step>","source":"monitor"}'

Kill the training process:

# Option 1: PID file (preferred — most reliable)
kill $(cat <exp_root>/logs/<round_dir>/<exp_id>/pid) 2>/dev/null

# Option 2: Safe pattern match — verify process is a training process before killing
# First find candidates, then verify cmdline contains python/train before killing
for pid in $(pgrep -f "<exp_id>"); do
  cmdline=$(cat /proc/$pid/cmdline 2>/dev/null | tr '\0' ' ')
  if echo "$cmdline" | grep -qE 'python|train'; then
    kill "$pid"
  fi
done

Warning: Never use bare pkill -f "<exp_id>" — it could match unrelated processes.

Record the divergence:
- Read the current experiment result file (if it exists)
- Ownership check: If the file already has status: "completed" or status: "failed", do NOT overwrite — the experiment finished first. Log to dev_notes: "Monitor detected divergence for <exp_id> but experiment already completed with status '<status>' — skipping overwrite." Skip to step 3.
- If the file does not exist, or has status: "running": update status to "diverged" and add divergence details to notes:
```
{
  "status": "diverged",
  "notes": "Divergence detected: <reason> at step <step>"
}
```
- Write the updated result using the Write tool
- Validate the written file:
```
python3 schema_validator.py <exp_root>/results/<exp_id>.json result
```
  If validation fails, fix the JSON before continuing.

Log the event: Append to <exp_root>/dev_notes.md:

## Divergence Detected
- Experiment: <exp_id>
- Reason: <reason>
- Step: <step>
- Action: Training process killed

Log to error tracker:

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/error_tracker.py <exp_root> log '{"category":"divergence","severity":"warning","source":"monitor","message":"<divergence reason>","exp_id":"<exp_id>","context":{"divergence_type":"<nan|explosion|plateau|drift>","step":<step>,"metric_to_watch":"<metric>"}}'

Log divergence pattern to behavioral memory:

When an experiment is killed due to divergence, also log the pattern to behavioral memory:

python3 ${CLAUDE_PLUGIN_ROOT}/scripts/goal_memory.py <exp_root> log-behavior divergence_pattern '{"description":"<metric> diverged at step <N> with config <config>","affected_branches":["<branch>"],"threshold":{"parameter":"<hp>","value":<val>},"source":"monitor"}'

This helps future hp-tune iterations avoid configs that trigger divergence.

Monitoring status:
- exp-001: healthy (loss=0.45 at step 500, trending down)
- exp-002: DIVERGED (NaN at step 350) - process killed
- exp-003: healthy (loss=0.52 at step 480, trending down)

Model Type	Explosion Threshold	Plateau Patience	Notes
CNN (classification)	5.0	20	Standard defaults
Transformer	10.0	30	Loss can be spikier
GAN	20.0	50	Inherently noisy training
Diffusion model	10.0	40	Slow convergence is normal
Fine-tuning	3.0	15	Should converge faster

Metric	Polarity	Divergence Signal
policy_loss / actor_loss	lower is better	Standard: NaN/explosion/plateau
value_loss / critic_loss	lower is better	Standard: NaN/explosion/plateau
reward / episode_return	higher is better	Collapse: drops >50% from rolling max over 100 episodes
entropy	context-dependent	Entropy collapse: drops below 0.01

Monitor

Experiment Monitor

Inputs Expected

Monitor

Experiment Monitor

Inputs Expected

Step 1: Validate Inputs

Step 2: Poll Loop

2a: Read Latest Log Content

2b: Parse Metrics

2b.1: Metric Name Fallback

2c: Check for Divergence

2d: Overfitting Check (if `overfitting_check` provided)

2e: Take Action on Divergence

2f: Report Status

Step 3: Completion

Monitoring Heuristics

Check frequency

Divergence parameters (defaults, adjust based on model type)

Recommended thresholds by model type

RL Model Monitoring

Common divergence patterns

Error Handling

Output

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio

Monitor

Experiment Monitor

Inputs Expected

Monitor

Experiment Monitor

Inputs Expected

Step 1: Validate Inputs

Step 2: Poll Loop

2a: Read Latest Log Content

2b: Parse Metrics

2b.1: Metric Name Fallback

2c: Check for Divergence

2d: Overfitting Check (if overfitting_check provided)

2e: Take Action on Divergence

2f: Report Status

Step 3: Completion

Monitoring Heuristics

Check frequency

Divergence parameters (defaults, adjust based on model type)

Recommended thresholds by model type

RL Model Monitoring

Common divergence patterns

Error Handling

Output

Automation Audit Ops

Github Qa Labels

Jupyter Notebook

Tidb Integrationtest Recorder

Quality Nonconformance

Hugging Face Trackio

2d: Overfitting Check (if `overfitting_check` provided)