4-level repair escalation for broken fleet channels — notify, guide, assist, remote
Repair is not optional — it is triggered automatically when P3+ channels are used for delivery. When a message falls back to P3+ channels, it means P1-P2 are broken. The repair system diagnoses and fixes connectivity issues through a 4-level escalation chain.
Goal: get every peer to P1+P2 operational. The fleet is not healthy until all nodes can communicate via P1 (NATS) and P2 (HTTP direct). Every other channel is a fallback, and relying on fallbacks means degraded performance and reliability.
| Level | Action | Automation | Human needed? |
|---|---|---|---|
| L1 Notify | Log the failure, alert the fleet dashboard | Fully automatic | No |
| L2 Guide | Send repair instructions to the target node's session | Fully automatic | Human reads instructions |
| L3 Assist | Spawn a repair agent on the target via SSH | Fully automatic | No |
| L4 Remote |
| Execute repair commands directly via SSH |
| Requires approval |
| Yes (approve) |
Message delivery falls back to P3+
→ L1: Log to fleet dashboard + local repair log
→ L2: Send context message with repair steps to target node
→ Wait 120s — did P1/P2 recover?
→ L3: SSH into target, spawn repair agent (claude -p)
→ Wait 300s — did P1/P2 recover?
→ L4: Present remote fix commands to human for approval
Repairs trigger automatically when the fallback delivery system detects degraded channels. No manual action needed for L1-L3.
# Send a repair request to a specific node
curl -sf -X POST http://127.0.0.1:8855/repair -H "Content-Type: application/json" \
-d '{"target": "<node-id>", "channel": "nats", "level": "assist"}'
# Check repair status
curl -s http://127.0.0.1:8855/repair/status | python3 -m json.tool
python3 tools/fleet_nerve_nats.py repair <node-id>
python3 tools/fleet_nerve_nats.py repair <node-id> --level guide
python3 tools/fleet_nerve_nats.py repair <node-id> --channel http --level remote
context message to the target with diagnostic steps:
pgrep -f fleet_nerve)lsof -i :8855)nats pub test "ping")claude -p repair agent with diagnostic prompt/tmp/atlas-agent-results/fleet-repair-<id>.logssh <user>@<host> "launchctl kickstart -k gui/$(id -u)/com.multifleet.nerve"ssh <user>@<host> "brew services restart nats-server"ssh <user>@<host> "bash ~/fleet/scripts/fleet-tunnel.sh restart"All repair actions are logged to the fleet dashboard and locally:
# View recent repairs
curl -s http://127.0.0.1:8855/repair/log | python3 -m json.tool
# Filter by node
curl -s "http://127.0.0.1:8855/repair/log?node=<node-id>" | python3 -m json.tool
In .multifleet/config.json:
{
"repair": {
"autoEscalate": true,
"l2WaitSeconds": 120,
"l3WaitSeconds": 300,
"l4RequiresApproval": true,
"maxRetriesPerHour": 3
}
}
| Key | Default | Description |
|---|---|---|
autoEscalate | true | Automatically escalate through levels |
l2WaitSeconds | 120 | Wait time before escalating L2 to L3 |
l3WaitSeconds | 300 | Wait time before escalating L3 to L4 |
l4RequiresApproval | true | Require human approval for remote execution |
maxRetriesPerHour | 3 | Rate limit to prevent repair storms |
For fleet-wide repair, spawn background agents to fix broken channels across all peers:
# Check which peers have broken channels
curl -s http://127.0.0.1:8855/channels | python3 -c "
import sys, json
data = json.load(sys.stdin)
for peer, ch in data.get('peers', {}).items():
broken = [p for p, v in ch.items() if v.get('status') == 'fail' and p in ('p1_nats','p2_http')]
if broken:
print(f'{peer}: {broken} — needs repair')
"
# Trigger repair for all broken peers
for node in $(curl -s http://127.0.0.1:8855/channels | python3 -c "
import sys, json
data = json.load(sys.stdin)
for peer, ch in data.get('peers', {}).items():
if any(v.get('status') == 'fail' for p, v in ch.items() if p in ('p1_nats','p2_http')):
print(peer)
"); do
curl -sf -X POST http://127.0.0.1:8855/repair -H "Content-Type: application/json" \
-d "{\"target\": \"$node\", \"level\": \"assist\"}"
done
degraded in the fleet dashboard