Delivery confirmation protocol — ACK tracking, retry on timeout, failure alerting
Every message sent via the fleet gets tracked for delivery confirmation. If an ACK is not received within the timeout, the message is automatically retried. After max retries, a failure alert is injected.
FleetNerveStore) with ack_status=awaitingfleet.<sender>.ackdeliveredsend_with_fallbackfailed and alert injected into sessionfleet.<node>.ack — each node subscribes to its own ACK subject{"ref": "<msg_id>", "status": "delivered", "node": "<sender>", "channel": "P1_nats"}curl -sf http://127.0.0.1:8855/ack/abc123 | python3 -m json.tool
Returns: {"id": "abc123", "type": "task", "from_node": "mac1", "to_node": "mac2", "ack_status": "delivered", "retry_count": 0}
Status values: none, awaiting, delivered, processed, failed
Two retry systems work in parallel:
_pending_acks dict)_watchdog_check_pending_acks() every 30sage > ack_timeout_s and retries < max_retries_ack_retry_loop)ack_status=awaitingack_status=failed and injects alert into session| Env Variable | Default | Description |
|---|---|---|
FLEET_ACK_TIMEOUT_S | 60 | Seconds to wait for ACK before retry |
FLEET_ACK_MAX_RETRIES | 3 | Maximum retry attempts |
Sender Receiver
|-- message (P1 NATS) -------->|
| |-- process message
|<---- ACK (fleet.<sender>.ack)|
|-- update status: delivered |
On timeout (no ACK received):
Sender
|-- 60s elapsed, no ACK
|-- retry via send_with_fallback (tries next channel)
|-- retry 2... retry 3...
|-- max retries: mark failed, inject alert
Messages are persisted in SQLite via FleetNerveStore:
ack_status column: none|awaiting|delivered|processed|failedretry_count column: incremented on each retrylast_retry_at column: timestamp of last retry attemptWhen a message fails after max retries, the ACK retry loop injects a context message:
Fleet: message <id> to <target> failed after 3 retries.
Check target connectivity. Last channel attempted: <channel>
This surfaces delivery failures rather than letting them persist silently (Zero Silent Failures invariant).