Use when performing network partition recovery — runbook for diagnosing and recovering from network partition events across distributed systems. Covers partition detection, impact assessment, split-brain resolution, data reconciliation, connectivity restoration, and post-recovery validation to restore full cluster consistency.
Recover from {{ partition_type }} affecting {{ affected_systems }}.
PARTITION ASSESSMENT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[ ] Partition detected — timestamp: ___
[ ] Affected systems: {{ affected_systems }}
[ ] Partition type: {{ partition_type }}
[ ] Scope:
- Nodes/services on side A: ___
- Nodes/services on side B: ___
- Fully isolated nodes: ___
[ ] Impact assessment:
- Services degraded: ___
- Services fully unavailable: ___
- Users affected (estimated): ___
SPLIT-BRAIN CHECK
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[ ] Determine if split-brain has occurred:
- Multiple leaders elected: [ ] YES [ ] NO
- Divergent writes detected: [ ] YES [ ] NO
- Quorum status:
Side A: ___ nodes (quorum: [ ] YES [ ] NO)
Side B: ___ nodes (quorum: [ ] YES [ ] NO)
DECISION MATRIX
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Scenario | Action
No split-brain | Restore connectivity, verify
Split-brain, one side | Fence minority side, restore
has quorum |
Split-brain, no | Manual intervention, pick
quorum either side | canonical side
Divergent writes | Data reconciliation required
NETWORK RECOVERY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[ ] Root cause identified:
[ ] Security group / firewall change
[ ] Route table misconfiguration
[ ] VPN/peering connection failure
[ ] Physical network issue
[ ] Service mesh / overlay network failure
[ ] Fix applied — timestamp: ___
[ ] Connectivity verified (ping, traceroute, TCP checks):
- Side A -> Side B: [ ] OK
- Side B -> Side A: [ ] OK
- Latency restored to baseline: [ ] YES
DATA RECONCILIATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[ ] Identify divergent data (if split-brain occurred):
- Conflicting records: ___
- Conflict resolution strategy:
[ ] Timestamp-based (last write wins)
[ ] Application-specific merge
[ ] Manual review required
[ ] Reconciliation executed — timestamp: ___
[ ] Data consistency verified across all nodes
[ ] Replication caught up and healthy
VALIDATION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[ ] All services healthy and serving traffic
[ ] Cluster membership correct (all nodes visible)
[ ] No remaining network errors in logs
[ ] Metrics returned to baseline:
- Error rate: ___% (baseline: ___%)
- Latency: ___ms (baseline: ___ms)
[ ] Monitoring alerts cleared
[ ] Incident timeline documented
[ ] Preventive measures identified:
- ___
- ___
| Shortcut | Counter | Why |
|---|---|---|
| "We can skip some steps for this case" | Adapt the workflow steps, don't skip them | Skipped steps are where incidents and oversights originate |
| "The user seems to already know what to do" | Complete all workflow phases with the user | The workflow catches blind spots that experience alone misses |
| "This is a minor case, full process is overkill" | Scale the process down, don't turn it off | Minor cases become major when unstructured; the process scales, not disappears |
| "I'll fill in the details later" | Complete each section before moving on | Deferred details are forgotten; real-time capture is more accurate |
| "The template output isn't necessary" | Always produce the structured output format | Structured output enables comparison, audit trails, and handoff to other teams |
Produce a partition recovery report with: