Name: Operational Excellence
Author: imsanghaar

Operational Excellence

Use when implementing Kubernetes cost visibility (OpenCost, VPA), backup/disaster recovery (Velero, RTO/RPO), or chaos engineering (Chaos Mesh). Triggers on cost optimization, right-sizing, FinOps, backup schedules, restore procedures, resilience testing, game days. NOT for basic resource requests/limits (Ch50) or HPA/KEDA autoscaling (Ch56).

imsanghaar0 Sterne20.02.2026

Beruf
Kategorien: Container

Persona

You are an SRE/FinOps expert who understands that operational excellence means balancing cost efficiency with disaster preparedness and system resilience. You've managed production Kubernetes clusters and know that cost savings mean nothing if systems can't recover from failure.

Decision Tree

What operational task?
├── Cost Visibility/Optimization
│   ├── Need to see where money goes? → OpenCost (L03)
│   ├── Pods over/under-provisioned? → VPA recommendations (L02)
│   ├── Need budget alerts? → FinOps practices (L04)
│   └── Team-level billing? → Cost allocation labels (L04)
│
├── Backup & Disaster Recovery
│   ├── Need namespace backups? → Velero Schedule (L06)
│   ├── Defining recovery requirements? → RTO vs RPO analysis (L05)
│   ├── Database-aware backups? → Velero hooks (L06)
│   └── Following 3-2-1 rule? → Multi-location storage (L05)
│
├── Resilience Testing
│   ├── Test pod failure recovery? → PodChaos (L07)
│   ├── Test network partitions? → NetworkChaos (L07)
│   ├── Planned resilience validation? → Game Day (L07)
│   └── Recurring chaos tests? → Chaos Mesh Schedule (L07)
│
└── Compliance
    └── Data residency requirements? → Data sovereignty (L08)

Operational Excellence

imsanghaar0 Sterne20.02.2026

Beruf
Kategorien: Container

Decision Tree

What operational task? ├── Cost Visibility/Optimization │ ├── Need to see where money goes? → OpenCost (L03) │ ├── Pods over/under-provisioned? → VPA recommendations (L02) │ ├── Need budget alerts? → FinOps practices (L04) │ └── Team-level billing? → Cost allocation labels (L04) │ ├── Backup & Disaster Recovery │ ├── Need namespace backups? → Velero Schedule (L06) │ ├── Defining recovery requirements? → RTO vs RPO analysis (L05) │ ├── Database-aware backups? → Velero hooks (L06) │ └── Following 3-2-1 rule? → Multi-location storage (L05) │ ├── Resilience Testing │ ├── Test pod failure recovery? → PodChaos (L07) │ ├── Test network partitions? → NetworkChaos (L07) │ ├── Planned resilience validation? → Game Day (L07) │ └── Recurring chaos tests? → Chaos Mesh Schedule (L07) │ └── Compliance └── Data residency requirements? → Data sovereignty (L08)

Mode	Behavior	When to Use
`Off`	Generate recommendations only	Production first steps; validate before acting
`Initial`	Apply to new pods only	Conservative; existing pods unchanged
`Recreate`	Evict and recreate pods	After validation; when restarts acceptable

Operational Excellence

Persona

Decision Tree

Operational Excellence

Persona

Decision Tree

Core Technologies

VPA (Vertical Pod Autoscaler)

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow