Fetch top unaddressed CloudZero Optimize recommendations, dispatch parallel research agents per item, apply SRE critique, and surface actionable findings with confidence verdicts and per-resource report files. Read-only research only.
Run a parallel, multi-agent cost optimization research sweep. Fetch the top unaddressed high-value recommendations from CloudZero Optimize, dispatch concurrent research agents to enrich each one, have each agent critique its own findings with a DevOps/SRE lens, and surface only the recommendations that look genuinely actionable.
This skill is research-only. No changes are made to any resource, configuration, code, or cloud account at any point. Read access only.
If you hit a dead end, missing credentials, or an ambiguous signal — stop and ask. Do not guess. Do not proceed past a blocker silently.
This skill runs read-only cloud CLI commands (e.g. aws ec2 describe-instances,
gcloud compute instances describe, kubectl get). For defense-in-depth, configure a
read-only IAM role or profile for the cloud accounts being investigated. Avoid
running this skill with credentials that have write access to production resources.
Call directly (do not delegate) with , sorted by descending. Fetch at least 25 results.
get_optimize_recsstatus = unaddressedcost_impact_last_30_daysStep 1a — Threshold. If provided in invocation args (e.g. "threshold=200"), use it silently. Otherwise default to $500. Do not prompt yet.
Step 1b — Individuals. A recommendation qualifies if cost_impact_last_30_days >= [threshold]. Take up to the top 3 by savings descending.
Step 1c — Clusters. Group remaining recommendations by recommendation_type_name +
account (add region if still under threshold). Sum savings within each group. Keep
groups where combined savings >= [threshold] and no member qualified individually.
Step 1d — Present and offer. Show qualifying individuals in a table:
| # | Recommendation | Resource (Account) | Impact/mo | Effort |
|---|---|---|---|---|
| 1 | ... | resource (alias) | $X,XXX | Low |
Effort: Low = config/policy change; Medium = code or IaC change; High = migration or
multi-team coordination.
If sub-threshold clusters exist, describe them in prose below the table (combined savings + why collectively meaningful), then offer:
"Want me to: 1. Research just the [N] item(s) over $[threshold], 2. Lower the threshold and include the clusters too, or 3. Something else?"
Skip the offer if no clusters exist. Stop if nothing qualifies at all.
Wait for the user's response. "2" or equivalent = include clusters (cap still 3 total). "Keep agents unblocked" = run all phases in parallel without further confirmation.
Dispatch all agents concurrently as background agents. Each is independent.
Use resource-type-researcher hyphenated labels (e.g. s3-api-researcher,
rds-snapshot-researcher, gcp-storage-researcher). Present a dispatch table first:
| Agent | Target | Est. Savings |
|---|---|---|
| [resource-type]-researcher | [Recommendation type] ([account-alias]) | $X,XXX/mo |
| ... | ... | ... |
Pass each agent: recommendation type, cloud provider, account (with alias), region(s),
resource list (names/ARNs/IDs), resource type, combined monthly savings, oldest
created timestamp, and group size.
For groups, investigate a representative sample (up to 5). Label each sampled resource:
Extrapolate from the sample to adjust the savings estimate for the full group.
Load the CloudZero docs page for this recommendation type — fetch https://docs.cloudzero.com/docs/optimize, find the matching slug, fetch it. Understand what the rec detects, the safe-to-act threshold, and "do not act" signals. If unavailable, proceed with general judgment.
Run all applicable research checks below in parallel. Read-only only.
No access to a check → note "not checked — no access" and continue.
Check errors or loops → stop that check, note the failure, continue.
Search local IaC files for the resource identifier. Always single-quote interpolated resource identifiers in shell commands to prevent metacharacter interpretation:
grep -r '[resource-id]' . \
--include="*.tf" --include="*.tfvars" \
--include="*.yaml" --include="*.yml" --include="*.json" \
--include="*.ts" --include="*.py" 2>/dev/null
If found: resource is managed — changes must go through IaC, not applied directly. Note the file(s) and relevant stanza. If a GitOps pipeline (Spacelift, Atlantis, ArgoCD) is active, direct changes will be reverted.
Search git history for the resource identifier and recent IaC changes:
git log --all --oneline -S '[resource-id]'
git log --all --oneline --since="90 days ago" -- "*.tf" "*.yaml" "*.yml"
For remote repos: identify the owning repo from service name or team tags, find the
default branch (gh repo view [org]/[repo] --json defaultBranchRef), then search
commits in a ±2-week window around created. For any suspicious commit fetch the PR
body to understand intent and affected files.
For service-driven recommendations (not orphaned resources), trace the trigger chain: what calls this resource, how often, and from where? Focus on CronJobs, Airflow DAGs, EventBridge rules, Step Functions, and Pub/Sub consumers. Note commit authors — they are the right people to involve in remediation.
Jira: Search for the resource identifier, resource name, and owning service/team. An open "in use / do not touch" or active sprint ticket → SKIP signal. A closed cleanup ticket → corroborates the recommendation. No tickets → consistent with abandonment.
Slack (if MCP available): Search for the resource identifier and name. Recent mentions suggest active awareness or ongoing use.
CloudZero cost trend: Pull 60-day spend. Growing = more urgent. Declining = may already be addressed upstream. Note which team, product, or cost center owns the spend.
Query the cloud provider (read-only) to confirm current state and any existing optimizations. Tailor to resource type:
aws s3api get-bucket-lifecycle-configuration + list-bucket-intelligent-tiering-configurationsaws ec2 describe-instances + 30-day CloudWatch CPUUtilizationaws ec2 describe-volumesaws ec2 describe-addressesaws rds describe-db-instancesaz vm show / az disk show / az storage account showgcloud compute instances describe / gcloud compute disks describe / gcloud storage buckets describeNote what is already optimized vs. what is missing — partial optimization is still a valid finding.
Deletion / release (idle VMs, disks, IPs, DBs): Check for dependencies (DNS, security groups, load balancer targets, mount targets, snapshots). Verify genuinely idle — not a quiet period. No owner tags = consistent with abandonment, but not proof.
Rightsizing: Confirm 30+ days of consistently low utilization — not a recent dip. Check for ASG / HPA / managed node group (resize the template, not the instance). Check recent deploy history.
Reservations / commitments: Confirm utilization rate justifies commitment length. Check existing coverage. Flag upcoming migrations that could strand the commitment.
Kubernetes: kubectl top pods, describe pod, get pvc --all-namespaces,
get hpa --all-namespaces.
API / egress costs: These are often symptoms. Trace what is generating the calls (CronJobs, DAGs, polling loops). A storage class change won't fix a hot polling loop.
Storage tiering: Distinguish transient vs. long-lived prefixes. Identify external tables, ETL jobs, or warehouse queries that read from the bucket and would be affected by retrieval latency from colder tiers.
Upgrade / migration: Confirm IaC won't revert the change. Note pre-upgrade steps, compatibility validation, and change management requirements.
Do not narrate or label this phase in any output. The only output is the confidence verdict.
Internally challenge findings on: recency bias (7 days ≠ idle; need 30+) — deploy cycle (recent commits explaining low usage?) — ownership ambiguity (no tags ≠ abandoned; team may just not tag) — dependency shadows (quiet failover targets, quarterly-write buckets, VPN-anchoring EIPs) — IaC revert risk (direct change will be undone on next apply) — symptom vs. root cause (polling loop driving LIST costs won't be fixed by a tier change) — savings confidence (backward-looking estimate vs. a changing cost trend).
Assign a verdict. For groups, apply to the valid subset only:
Mixed groups: Split before assigning — surface VALID as a sub-group with adjusted savings, count INVALID as SKIP, treat UNCERTAIN as MEDIUM. If the valid sub-group falls below [threshold], suppress it too.
Surface each result as soon as its agent completes. Every result message ends with:
"Waiting on N more agents: [agent-name-1], [agent-name-2]..."
For each result show: agent name, recommendation type, 2–3 sentence summary (finding, realistic savings, strongest supporting or contradicting signal), and confidence verdict.
When the last agent completes, present:
| # | Recommendation | Resource (Account) | Claimed | Realistic | Verdict | Effort | Contact | Next Step |
|---|
Only include HIGH and MEDIUM rows. One line below for suppressed items: "N suppressed: [brief reason]." Then write report files (Phase 6).
"Want me to dig deeper on any of these — check git logs for a specific item, look up team contacts, or anything else?"
If the user asks deeper questions after the rollup, dispatch a new round of focused
parallel agents (e.g. git-log-[resource], [team]-contact-finder). Surface results
progressively with the same "Waiting on N more agents..." pattern. If a finding
materially changes a confidence verdict, note it and update the rollup row.
Write ./optimize-triage-reports/[resource-name]-investigation-[date].md for each HIGH
and MEDIUM result before presenting the rollup. Create the directory if it does not
exist. These reports may contain sensitive organizational data (account IDs, resource
identifiers, cost figures, team contacts, IaC configurations) — treat them as
confidential and do not commit them to version control.
Each report contains:
Executive Summary — 2–4 sentences: what the resource is, what it costs, what the opportunity is. If realistic savings differs from CloudZero's estimate, explain why.
Current State — Resource details table (identifier, account, region, type, monthly cost, owner tag, IaC location), cost trend table (one row per month), plain-language trend interpretation, and existing config (IaC stanza or lifecycle rule, if found).
What This Resource Is — Purpose, data flow / trigger chain, downstream consumers, active/legacy/uncertain status. Based on evidence found — if ownership is unclear, say so.
Recommendation — One paragraph describing the action and why. Include copy-pasteable implementation details (code, commands, config) where gathered. Do not fabricate specifics.
Estimated Savings — Table: CloudZero estimate vs. conservative vs. optimistic, with a brief explanation of any difference.
Risks and Mitigations — One subsection per evidence-backed risk. Name the actual dependency, job, table, or compliance concern. No generic boilerplate.
Suggested Rollout Plan — Phased table: Phase 0 investigation (open questions to answer first) → Phase 1 first safe change → Phase 2 monitor → Phase 3 follow-on if applicable.
File References — IaC or source file paths, if found.
Related Tickets — Only if tickets were found.
Questions for the Owner — 3–6 specific, answerable questions based on actual open questions from research. Not generic boilerplate.
Do not make any changes to any resource, configuration, code, or cloud account. All actions in this skill are read-only.