Proactive fleet health monitoring — channel checks, threshold alerts, automatic repair triggers
Watchdog doesn't just monitor — it heals. Continuous background monitor that checks fleet health, triggers repairs before humans notice problems, and redirects idle sessions to channel repair work. Runs as part of the Fleet Nerve daemon.
| Check | Interval | Threshold | Action on failure |
|---|---|---|---|
| Peer heartbeat | 10s | Missing >60s | Mark offline, trigger wake |
| NATS connectivity | 30s | 2 consecutive failures | Trigger L1 repair |
| API key sources | 300s | Key using fallback (aws/none) | Notify fleet, recommend keychain setup |
| HTTP reachability | 30s | 3 consecutive failures | Trigger L2 repair |
| Daemon process | 60s | Not running | Restart via LaunchAgent |
| Tunnel health | 60s | autossh dead or latency >5s | Restart tunnel |
| Disk space | 300s | /tmp/atlas-agent-results/ >1GB | Prune old results |
| Task timeout | 30s | Running >600s | Kill + report failure |
| Self idle | 30s | No NATS activity >5min | Check inbox/seed, broadcast idle, git pull |
| Peer idle | 30s | Heartbeat >10min or 0 sessions | Send idle-wake (max 1 per peer per 30min) |
# Current watchdog state
curl -s http://127.0.0.1:8855/watchdog/status | python3 -m json.tool
Response:
{
"running": true,
"uptimeSeconds": 86400,
"checks": {
"heartbeat": {"status": "ok", "lastRun": "2s ago", "failures": 0},
"nats": {"status": "ok", "lastRun": "15s ago", "failures": 0},
"http": {"status": "degraded", "lastRun": "8s ago", "failures": 2, "target": "node2"},
"tunnel": {"status": "n/a", "reason": "tunnel not enabled"},
"disk": {"status": "ok", "lastRun": "120s ago", "usageMb": 45}
},
"repairsTriggered": 3,
"lastRepair": "2h ago — node2 http L1"
}
# Recent watchdog events (last 50)
curl -s http://127.0.0.1:8855/watchdog/log | python3 -m json.tool
# Filter by severity
curl -s "http://127.0.0.1:8855/watchdog/log?severity=warn" | python3 -m json.tool
Log levels: info (routine checks), warn (threshold approached), error (check failed), repair (auto-repair triggered).
In .multifleet/config.json:
{
"watchdog": {
"enabled": true,
"heartbeatStaleSeconds": 60,
"natsFailThreshold": 2,
"httpFailThreshold": 3,
"taskTimeoutSeconds": 600,
"diskLimitMb": 1024,
"autoRepairLevel": "assist",
"logRetentionHours": 48
}
}
| Key | Default | Description |
|---|---|---|
enabled | true | Enable watchdog monitoring |
heartbeatStaleSeconds | 60 | Mark peer offline after this many seconds |
natsFailThreshold | 2 | Consecutive NATS failures before repair |
httpFailThreshold | 3 | Consecutive HTTP failures before repair |
taskTimeoutSeconds | 600 | Kill tasks running longer than this |
diskLimitMb | 1024 | Prune agent results above this size |
autoRepairLevel | "assist" | Maximum auto-repair level (notify/guide/assist) |
logRetentionHours | 48 | How long to keep watchdog logs |
idleThresholdSeconds | 300 | Mark self idle after this many seconds without NATS activity |
peerIdleThresholdSeconds | 600 | Consider peer idle after this many seconds |
idleWakeCooldownSeconds | 1800 | Min interval between idle-wake messages to same peer |
| Variable | Description |
|---|---|
FLEET_WATCHDOG_DISABLED | Set to 1 to disable watchdog entirely |
FLEET_WATCHDOG_INTERVAL | Override base check interval (seconds) |
Idle sessions are redirected to channel repair first. When watchdog detects a session idle >5 minutes:
/channels endpoint)5-minute cooldown between repair attempts per peer. This prevents repair storms when a peer is genuinely unreachable. The cooldown resets when a channel state changes.
# Check idle session redirection state
curl -s http://127.0.0.1:8855/watchdog/idle | python3 -m json.tool
# Shows: which sessions are idle, what repair work they've been assigned, cooldown timers
Watchdog is the trigger source for the repair escalation system. When a check crosses its failure threshold:
autoRepairLevelfleet-repair skill)The autoRepairLevel cap prevents watchdog from executing remote commands without human oversight. Set to "notify" for monitoring-only mode.
Watchdog enforces the fleet-wide invariant: the fleet is not healthy until ALL nodes have P1+P2 operational.
When a node has no NATS messages sent or received for >5 minutes, the watchdog activates idle proactivity.
.fleet-messages/<node>/ for unprocessed .md files/tmp/fleet-seed-<node>.md for pending seed messagesfleet.all.message with type idle)info level for observabilitysessions: 0 in heartbeat data or last heartbeat >10 minutes old are considered idleidle-wake message: "You've been idle for Xmin. Any pending fleet work? Check plans."self.node_id and heartbeat data only — no hardcoded node names_touch_activity() is called on every message send and receiveThe watchdog detects idle peers and converts downtime into productive work. Built on top of the heartbeat check already running every 10s.
How it works:
docs/plans/ incomplete markers)Implementation reference: Extends the existing heartbeat check in the watchdog loop. The idle state is derived from heartbeat metadata fields (lastTaskCompleted, activeTask). Nudges are sent via the same NATS/HTTP channels used for fleet messaging.
# Check idle status of all peers
curl -s http://127.0.0.1:8855/watchdog/status | python3 -c "
import json, sys; d=json.load(sys.stdin)
for p,s in d.get('peers',{}).items():
print(f'{p}: idle={s.get(\"idleSeconds\",\"?\")}s lastTask={s.get(\"lastTask\",\"none\")}')
"
When idle, the watchdog shifts from monitoring to productive work generation.
Task discovery sources (checked in order):
| Source | What to look for | Priority |
|---|---|---|
docs/plans/ | Incomplete items, TODOs, unstarted phases | High |
.fleet-messages/ | Unanswered proposals, pending reviews | High |
git log --since=24h | Recent changes lacking tests or docs | Medium |
docs/inbox/ | Unprocessed raw ideas needing triage | Low |
| Plugin/skill health | Broken or degraded capabilities | Medium |
What the watchdog does with discovered tasks:
.claude/focus.json)# View generated task suggestions
curl -s http://127.0.0.1:8855/watchdog/suggestions | python3 -m json.tool
# View background investigation results
ls /tmp/atlas-agent-results/watchdog-*.json
Idle time is study time. The watchdog uses downtime to build and maintain system-wide understanding.
What the watchdog studies:
3s probe), reports degradationgit diff --stat against test directories to find untested changes.fleet-messages/ for fleet reviewAwareness outputs are written to:
/tmp/atlas-agent-results/watchdog-awareness-<timestamp>.json (detailed findings)Idle detection and proactive work are rate-limited to prevent noise and interruption of active work.
| Action | Limit | Rationale |
|---|---|---|
| Productive nudge per peer | 1 per 5 minutes | Avoid nagging — one nudge is enough |
| Self-assigned background task | 1 per 10 minutes | Prevent resource contention on local machine |
| Fleet-wide idle proposal | 1 per 15 minutes | Coordinated work needs breathing room |
| Background agent spawn | Max 3 concurrent | Stay within crash-prevention limits (CLAUDE.md: max 10 agents) |
| Big picture scan | 1 per 30 minutes | Study is low-priority, yields to any real work |
Hard rules:
activeTask is set, suppress all idle suggestions for that peerConfiguration (in .multifleet/config.json):
{
"watchdog": {
"idleDetection": {
"enabled": true,
"idleThresholdSeconds": 300,
"nudgeIntervalSeconds": 300,
"selfTaskIntervalSeconds": 600,
"maxBackgroundAgents": 3,
"bigPictureScanIntervalSeconds": 1800
}
}
}