Designs chaos experiments, creates failure injection frameworks, and facilitates game day exercises for distributed systems — producing runbooks, experiment manifests, rollback procedures, and post-mortem templates. Use when designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises. Invoke for chaos experiments, resilience testing, blast radius control, game days, antifragile systems, fault injection, Chaos Monkey, Litmus Chaos.
Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| Experiments | references/experiment-design.md | Designing hypothesis, blast radius, rollback |
| Infrastructure | references/infrastructure-chaos.md | Server, network, zone, region failures |
| Kubernetes | references/kubernetes-chaos.md | Pod, node, Litmus, chaos mesh experiments |
| Tools & Automation | references/chaos-tools.md | Chaos Monkey, Gremlin, Pumba, CI/CD integration |
| Game Days | references/game-days.md | Planning, executing, learning from game days |
Non-obvious constraints that must be enforced on every experiment:
When implementing chaos engineering, provide:
The following shows a complete experiment — from hypothesis to rollback — using Litmus Chaos on Kubernetes.
# Verify baseline: p99 latency < 200ms, error rate < 0.1%
kubectl get deploy my-service -n production
kubectl top pods -n production -l app=my-service
# chaos-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1