Tames system entropy by designing resilient architectures that thrive under unpredictable conditions
Chaos Curator systematically validates and improves system resilience through controlled chaos experiments in distributed infrastructure. Real use cases:
Chaos Curator provides these exact commands:
chaos-curator experiment generate <template> --target <resource> --duration <time> --interval <check-frequency> --metrics-threshold <thresholds> --blast-radius <percentage>
chaos-curator experiment inject <experiment-id> --namespace <ns> --wait-for-completion --notify-slack --rollback-on-failure
chaos-curator experiment status <experiment-id> --watch --output json
chaos-curator experiment timeline <experiment-id> --format timeline
chaos-curator experiment abort <experiment-id> --grace-period <seconds> --force
chaos-curator validate infrastructure <cluster-name> --check-pod-disruption-budgets --check-hpa --check-pdb --check-network-policies
chaos-curator validate metrics <service-name> --query "rate(http_requests_total[5m])" --alert-rules --slo-baseline <value>
chaos-curator resilience score <namespace> --services <svc1,svc2> --include-chaos-metrics --report-format html
chaos-curator blast-radius calculate <experiment-yaml> --dependencies-from-istio --max-impact-percentage <value>
chaos-curator rollback create <experiment-id> --type helm --release-name <release> --namespace <ns> --timeout <seconds>
chaos-curator rollback execute <rollback-id> --verify-post-conditions --dry-run
chaos-curator dashboard open --experiment <id> --grafana-dashboard-id <id> --telemetry-channel <slack-channel>
chaos-curator hypothesis verify <hypothesis-id> --compare-baseline --statistical-significance <p-value>
Supported experiment types (real chaos actions):
pod-kill: kubectl delete pod --force --grace-period=0 with selector matchingnetwork-delay: tc qdisc add with netem causing delay 100ms 20ms distribution normalnetwork-loss: tc qdisc add with loss 10%network-corruption: tc qdisc add with corrupt 1%network-bandwidth: tc qdisc add with tbf rate 1mbit burst 32kbit latency 400msio-stress: stress-ng --io 2 --io-method randwrite --timeout 30scpu-stress: stress-ng --cpu 4 --cpu-method matrix --timeout 30smemory-stress: stress-ng --vm 2 --vm-bytes 1G --timeout 30sdns-chaos: dnsmasq returning NXDOMAIN or delayed responses for specific domainstime-skew: date -s "+100 seconds" inside container with SYS_TIME capabilityhttp-fault: Envoy/Istio fault injection with abort or delay percentagesgcp-zone-failure: gcloud compute instances delete with --zone targetingaws-az-failure: aws ec2 describe-instances + terminate-instances for specific AZazure-fault: az vm deallocate for VMSS instances in specific fault domainetcd-follower-down: systemctl stop etcd on follower nodes with leader protectionredis-master-failover: redis-cli -h <master-ip> debug segfault with sentinel promotionpostgres-replica-promotion: pg_ctl promote -D /var/lib/postgresql/datakafka-broker-down: Stop kafka broker process with ISR management verificationdisk-fill: dd if=/dev/zero of=/tmp/fill bs=1M count=5000 until 95% disk usageReal workflow for testing payment service resilience (E2E example):
# STEP 1: Pre-experiment safety checks
chaos-curator validate infrastructure payment-cluster \
--check-pod-disruption-budgets \
--check-hpa \
--check-network-policies
# Expected output:
# ✓ PDB for payment-service: minAvailable 70% (3/10 pods)
# ✓ HPA payment-service: target CPU 70%, current 45%, minReplicas 5, maxReplicas 20
# ✓ NetworkPolicies restrict ingress to trusted sources only
# ✓ All critical services have anti-affinity rules
# ✓ Database connection pool size: 50 (with 30% headroom)
# ⚠️ Alert for payment-service-error-rate > 1% configured (currently 0.05%)
# ✓ Backup completed 2 hours ago (verified restore test weekly)
# STEP 2: Calculate blast radius before injecting
chaos-curator blast-radius calculate <<EOF
apiVersion: chaos-mesh.org/v1alpha1