Use this skill for cloud-hosted or continuously running agent systems that need operational controls beyond single CLI sessions.
Operational Domains
- runtime lifecycle (start, pause, stop, restart)
- observability (logs, metrics, traces)
- safety controls (scopes, permissions, kill switches)
- change management (rollout, rollback, audit)
Baseline Controls
- immutable deployment artifacts
- least-privilege credentials
- environment-level secret injection
- hard timeout and retry budgets
- audit log for high-risk actions
Metrics to Track
- success rate
- mean retries per task
- time to recovery
- cost per successful task
- failure class distribution
Incident Pattern
When failure spikes: