Write disaster recovery plans with RPO/RTO targets, failover procedures, communication protocols, and testing schedules — ensuring business continuity when systems fail.
Gather the following from the user before writing:
If the user says "write a DR plan for our app," push back: "Which failure scenario? A database corruption recovery is a different plan from a full region failover. Each scenario gets its own procedure with its own RPO/RTO targets."
State what this plan covers and what it does not. Define the specific systems, environments, and failure scenarios in scope. List any systems explicitly excluded and reference their separate DR plans if they exist.
Define recovery objectives for each system:
| System | RPO | RTO | Tier | Justification |
|---|---|---|---|---|
| Payment processing | 0 (zero data loss) | 15 minutes | Tier 1 | Revenue-critical, regulatory requirement |
| User database | 5 minutes | 30 minutes | Tier 1 | All services depend on auth |
| Analytics pipeline | 24 hours | 4 hours | Tier 2 | No revenue impact, can reprocess |
| Internal wiki | 24 hours | 48 hours | Tier 3 | Low urgency, daily backups sufficient |
Tier definitions:
For each system, document:
User database:
Method: Continuous WAL replication to standby + daily full snapshot
Frequency: Real-time replication; snapshots at 02:00 UTC daily
Retention: 30 daily snapshots, 12 weekly snapshots
Storage: AWS S3 us-west-2 (primary in us-east-1) — cross-region
Encryption: AES-256 at rest, TLS 1.3 in transit
Verification: Weekly automated restore test to staging; quarterly manual validation
Write step-by-step procedures for each disaster scenario. Each procedure must include:
Use the same step format as a runbook — copy-pasteable commands, expected output, and if/then branches at every decision point. Reference runbooks for detailed per-service procedures.
Define who is notified, when, and how:
| Audience | Channel | Timing | Message owner |
|---|---|---|---|
| Incident commander | PagerDuty | Immediate (automated) | Monitoring system |
| Engineering leadership | Slack #incidents | Within 5 minutes | Incident commander |
| Customer support | Email + Slack | Within 15 minutes | Comms lead |
| Affected customers | Status page + email | Within 30 minutes | Comms lead |
| Executive team | Email summary | Within 1 hour | Program owner |
Include message templates for customer-facing communications at each stage: initial acknowledgment, progress update, and resolution confirmation.
A plan that has never been tested is a hypothesis, not a plan. Define:
Each test must produce a written report documenting: what was tested, pass/fail per step, time to complete each phase, and issues discovered with remediation owners.
Before delivering the plan, verify: