Name: Fleet Watchdog
Author: supportersimulator

Skills suchen.../

Fleet Watchdog | Skills Pool

# Current watchdog state
curl -s http://127.0.0.1:8855/watchdog/status | python3 -m json.tool

{
  "running": true,
  "uptimeSeconds": 86400,
  "checks": {
    "heartbeat": {"status": "ok", "lastRun": "2s ago", "failures": 0},
    "nats": {"status": "ok", "lastRun": "15s ago", "failures": 0},
    "http": {"status": "degraded", "lastRun": "8s ago", "failures": 2, "target": "node2"},
    "tunnel": {"status": "n/a", "reason": "tunnel not enabled"},
    "disk": {"status": "ok", "lastRun": "120s ago", "usageMb": 45}
  },
  "repairsTriggered": 3,
  "lastRepair": "2h ago — node2 http L1"
}

# Recent watchdog events (last 50)
curl -s http://127.0.0.1:8855/watchdog/log | python3 -m json.tool

# Filter by severity
curl -s "http://127.0.0.1:8855/watchdog/log?severity=warn" | python3 -m json.tool

{
  "watchdog": {
    "enabled": true,
    "heartbeatStaleSeconds": 60,
    "natsFailThreshold": 2,
    "httpFailThreshold": 3,
    "taskTimeoutSeconds": 600,
    "diskLimitMb": 1024,
    "autoRepairLevel": "assist",
    "logRetentionHours": 48
  }
}

Key	Default	Description
`enabled`	`true`	Enable watchdog monitoring
`heartbeatStaleSeconds`	`60`	Mark peer offline after this many seconds
`natsFailThreshold`	`2`	Consecutive NATS failures before repair
`httpFailThreshold`	`3`	Consecutive HTTP failures before repair
`taskTimeoutSeconds`	`600`	Kill tasks running longer than this
`diskLimitMb`	`1024`	Prune agent results above this size
`autoRepairLevel`	`"assist"`	Maximum auto-repair level (notify/guide/assist)
`logRetentionHours`	`48`	How long to keep watchdog logs
`idleThresholdSeconds`	`300`	Mark self idle after this many seconds without NATS activity
`peerIdleThresholdSeconds`	`600`	Consider peer idle after this many seconds
`idleWakeCooldownSeconds`	`1800`	Min interval between idle-wake messages to same peer

# Check idle session redirection state
curl -s http://127.0.0.1:8855/watchdog/idle | python3 -m json.tool
# Shows: which sessions are idle, what repair work they've been assigned, cooldown timers

# Check idle status of all peers
curl -s http://127.0.0.1:8855/watchdog/status | python3 -c "
import json, sys; d=json.load(sys.stdin)
for p,s in d.get('peers',{}).items():
    print(f'{p}: idle={s.get(\"idleSeconds\",\"?\")}s  lastTask={s.get(\"lastTask\",\"none\")}')
"

Source	What to look for	Priority
`docs/plans/`	Incomplete items, TODOs, unstarted phases	High
`.fleet-messages/`	Unanswered proposals, pending reviews	High
`git log --since=24h`	Recent changes lacking tests or docs	Medium
`docs/inbox/`	Unprocessed raw ideas needing triage	Low
Plugin/skill health	Broken or degraded capabilities	Medium

# View generated task suggestions
curl -s http://127.0.0.1:8855/watchdog/suggestions | python3 -m json.tool

# View background investigation results
ls /tmp/atlas-agent-results/watchdog-*.json

Action	Limit	Rationale
Productive nudge per peer	1 per 5 minutes	Avoid nagging — one nudge is enough
Self-assigned background task	1 per 10 minutes	Prevent resource contention on local machine
Fleet-wide idle proposal	1 per 15 minutes	Coordinated work needs breathing room
Background agent spawn	Max 3 concurrent	Stay within crash-prevention limits (CLAUDE.md: max 10 agents)
Big picture scan	1 per 30 minutes	Study is low-priority, yields to any real work

{
  "watchdog": {
    "idleDetection": {
      "enabled": true,
      "idleThresholdSeconds": 300,
      "nudgeIntervalSeconds": 300,
      "selfTaskIntervalSeconds": 600,
      "maxBackgroundAgents": 3,
      "bigPictureScanIntervalSeconds": 1800
    }
  }
}

Check	Interval	Threshold	Action on failure
Peer heartbeat	10s	Missing >60s	Mark offline, trigger wake
NATS connectivity	30s	2 consecutive failures	Trigger L1 repair
API key sources	300s	Key using fallback (aws/none)	Notify fleet, recommend keychain setup
HTTP reachability	30s	3 consecutive failures	Trigger L2 repair
Daemon process	60s	Not running	Restart via LaunchAgent
Tunnel health	60s	autossh dead or latency >5s	Restart tunnel

Check	Interval	Threshold	Action on failure
Peer heartbeat	10s	Missing >60s	Mark offline, trigger wake
NATS connectivity	30s	2 consecutive failures	Trigger L1 repair
API key sources	300s	Key using fallback (aws/none)	Notify fleet, recommend keychain setup
HTTP reachability	30s	3 consecutive failures	Trigger L2 repair
Daemon process	60s	Not running	Restart via LaunchAgent
Tunnel health	60s	autossh dead or latency >5s	Restart tunnel

Variable	Description
`FLEET_WATCHDOG_DISABLED`	Set to `1` to disable watchdog entirely
`FLEET_WATCHDOG_INTERVAL`	Override base check interval (seconds)

Fleet Watchdog

What It Monitors

Fleet Watchdog

What It Monitors

Reading Watchdog Status

Watchdog log

Configuration

Environment Variables

Idle Session Redirection

Interaction with Fleet Repair

Self-Healing Protocol Integration

Idle Proactivity Protocol

Self-Idle Detection

Peer-Idle Detection

Activity Tracking

Idle Detection Protocol

Proactive Task Generation

Big Picture Awareness

Rate Limiting

When to Use This Skill

Bluebubbles

Add Tracing

Analytics Events

Add Expert

Arthas

Arthas Eagleeye Traceid