Design and implement disaster recovery strategies with RTO/RPO planning, database backups, Kubernetes DR, cross-region replication, and chaos engineering testing. Use when implementing backup systems, configuring point-in-time recovery, setting up multi-region failover, or validating DR procedures.
Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering.
Invoke this skill when:
Recovery Time Objective (RTO): Maximum acceptable downtime after a disaster before business impact becomes unacceptable.
Recovery Point Objective (RPO): Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach.
Criticality Tiers:
Maintain 3 copies of data on 2 different media types with 1 copy offsite.
Example implementation:
Full Backup: Complete copy of all data. Slowest to create, fastest to restore.
Incremental Backup: Only changes since last backup. Fastest to create, requires full + all incrementals to restore.
Differential Backup: Changes since last full backup. Balance between storage and restore speed.
Continuous Backup: Real-time or near-real-time backup via WAL/binlog archiving. Lowest RPO.
RTO < 1 hour, RPO < 5 min
→ Active-Active replication, continuous archiving, automated failover
→ Tools: Aurora Global DB, GCS Multi-Region, pgBackRest PITR
→ Cost: Highest
RTO 1-4 hours, RPO 15-60 min
→ Warm standby, incremental backups, automated failover
→ Tools: pgBackRest, WAL-G, RDS Multi-AZ
→ Cost: High
RTO 4-24 hours, RPO 1-6 hours
→ Daily full + incremental, cross-region backup
→ Tools: pgBackRest, Velero, Restic
→ Cost: Medium
RTO > 24 hours, RPO > 6 hours
→ Weekly full + daily incremental, single region
→ Tools: pg_dump, mysqldump, S3 versioning
→ Cost: Low
| Use Case | Primary Tool | Alternative | Key Feature |
|---|---|---|---|
| PostgreSQL production | pgBackRest | WAL-G | PITR, compression, multi-repo |
| MySQL production | Percona XtraBackup | WAL-G | Hot backups, incremental |
| MongoDB | Atlas Backup | mongodump | Continuous backup, PITR |
| Kubernetes cluster | Velero | ArgoCD + Git | PV snapshots, scheduling |
| File/object backup | Restic | Duplicity | Encryption, deduplication |
| Cross-region replication | Aurora Global DB | RDS Read Replica | Active-Active capable |
Use Case: Production PostgreSQL with < 5 minute RPO
Quick Start: See examples/postgresql/pgbackrest-config/
Configure continuous WAL archiving with full/differential/incremental backups to S3/GCS/Azure. Schedule weekly full, daily differential backups. Enable PITR with pgbackrest --stanza=main --delta restore.
Detailed Guide: references/database-backups.md#postgresql
Use Case: MySQL production requiring hot backups
Quick Start: See examples/mysql/xtrabackup/
Perform full (xtrabackup --backup --parallel=4) and incremental backups with binary log archiving for PITR. Restore requires decompress, prepare, apply incrementals, and copy-back steps.
Detailed Guide: references/database-backups.md#mysql
Quick Start: Use mongodump --gzip --numParallelCollections=4 for logical backups or MongoDB Atlas for continuous backup with PITR.
Detailed Guide: references/database-backups.md#mongodb
Quick Start: velero install --provider aws --bucket my-backups
Configure scheduled backups (daily full, hourly production namespace) with PV snapshots. Restore with velero restore create --from-backup <name>. Support selective restore (namespace mappings, storage class remapping).
Examples: examples/kubernetes/velero/
Detailed Guide: references/kubernetes-dr.md
Quick Start: ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db
Create periodic etcd snapshots for control plane recovery. Restore requires cluster recreation with snapshot data.
Examples: examples/kubernetes/etcd/
Key Services:
Examples: examples/cloud/aws/
Detailed Guide: references/cloud-dr-patterns.md#aws
Key Services:
Detailed Guide: references/cloud-dr-patterns.md#gcp
Key Services:
Detailed Guide: references/cloud-dr-patterns.md#azure
| Pattern | RTO | RPO | Cost | Use Case |
|---|---|---|---|---|
| Active-Active | < 1 min | < 1 min | High | Both regions serve traffic |
| Active-Passive | 15-60 min | 5-15 min | Medium | Standby for failover |
| Pilot Light | 10-30 min | 5-15 min | Low | Minimal secondary infra |
| Warm Standby | 5-15 min | 5-15 min | Med-High | Scaled-down secondary |
Implementation Examples:
Detailed Guide: references/cross-region-replication.md
Purpose: Validate DR procedures through controlled failure injection.
Test Scenarios:
Tools: Chaos Mesh, Gremlin, Litmus, Toxiproxy
Examples: examples/chaos/db-failover-test.sh, examples/chaos/region-failure-test.sh
Detailed Guide: references/chaos-engineering.md
Run Monthly Tests:
./scripts/dr-drill.sh --environment staging --test-type full
./scripts/test-restore.sh --backup latest --target staging-db
| Regulation | Retention | Requirements |
|---|---|---|
| GDPR | 1-7 years | EU data residency, right to erasure |
| SOC 2 | 1 year+ | Secure deletion, access controls |
| HIPAA | 6 years | Encryption, PHI protection |
| PCI DSS | 3mo-1yr | Secure deletion, quarterly reviews |
Implement with S3/GCS lifecycle policies: 30d→Standard-IA, 90d→Glacier, 365d→Deep Archive
Immutable backups: Use S3 Object Lock or Azure Immutable Blob Storage for ransomware protection.
Detailed Guide: references/compliance-retention.md
Key Metrics: Backup success rate, duration, time since last backup, RPO breach, storage utilization
Prometheus Alerts: VeleroBackupFailed, VeleroBackupTooOld, BackupSizeTrend
Validation Scripts:
./scripts/validate-backup.sh --backup latest --verify-integrity
./scripts/check-retention.sh --report-violations
./scripts/generate-dr-report.sh --format pdf
Automate Backup Schedules: Cron for pgBackRest (weekly full, daily differential), Velero schedules (K8s)
DR Runbook Steps: Detect failure → Verify secondary → Promote → Update DNS → Notify → Document
Detailed Guide: references/runbook-automation.md
Prerequisites:
infrastructure-as-code: Provision backup infrastructure, DR regionskubernetes-operations: K8s cluster setup for Velerosecret-management: Backup encryption keys, credentialsParallel Skills:
databases-postgresql: PostgreSQL configuration and operationsdatabases-mysql: MySQL configuration and operationsobservability: Backup monitoring, alertingsecurity-hardening: Secure backup storage, access controlConsumer Skills:
incident-management: Invoke DR procedures during incidentscompliance-frameworks: Meet regulatory requirementsinfrastructure-as-code → secret-management → disaster-recovery → observability
↓ ↓ ↓ ↓
Create S3 buckets Store encryption Configure backups Monitor jobs
Provision databases keys in Vault Set up replication Alert failures
Setup VPCs Manage credentials Test DR drills Track metrics
✓ Test restores regularly (monthly for critical systems) ✓ Automate backup monitoring and alerting ✓ Encrypt backups at rest and in transit ✓ Implement 3-2-1 backup rule ✓ Define and measure RTO/RPO ✓ Run chaos experiments to validate DR ✓ Document recovery procedures ✓ Store backups in different regions ✓ Use immutable backups for ransomware protection ✓ Automate DR testing in CI/CD
✗ Assume backups work without testing ✗ Store all backups in single region ✗ Skip retention policy definition ✗ Forget to encrypt sensitive data ✗ Rely solely on cloud provider backups ✗ Ignore backup monitoring ✗ Perform backups only from primary database under high load ✗ Store encryption keys with backups
references/rto-rpo-planning.mdreferences/database-backups.mdreferences/kubernetes-dr.mdreferences/cloud-dr-patterns.mdreferences/cross-region-replication.mdreferences/chaos-engineering.mdreferences/compliance-retention.mdreferences/runbook-automation.mdexamples/runbooks/database-failover.md, examples/runbooks/region-failover.mdexamples/postgresql/pgbackrest-config/, examples/postgresql/walg-config/examples/mysql/xtrabackup/, examples/mysql/walg/examples/kubernetes/velero/, examples/kubernetes/etcd/examples/cloud/aws/, examples/cloud/gcp/, examples/cloud/azure/examples/chaos/db-failover-test.sh, examples/chaos/region-failure-test.shscripts/validate-backup.sh: Verify backup integrityscripts/test-restore.sh: Automated restore testingscripts/dr-drill.sh: Run full DR drillscripts/check-retention.sh: Verify retention policiesscripts/generate-dr-report.sh: Compliance reporting