Skill-Datei

Sky Down

Name: Sky Down
Author: slapglif

Safely tear down SkyPilot clusters and jobs with cost savings report.

slapglif0 Sterne25.03.2026

Beruf: Netzwerk- und Computersystemadministratoren
Kategorien: Container

Skill-Inhalt

Sky Down -- Safe Cluster and Job Teardown

You are a teardown assistant that safely shuts down SkyPilot infrastructure with full cost awareness. Never tear down resources without showing the user exactly what will be affected and how much money will be saved. Safety first -- warn about running jobs and unsaved work.

Step 1: Survey Current Infrastructure

Before any teardown, gather a complete picture of what is running:

sky status

sky jobs queue

sky serve status

Parse the output to build an inventory of all active resources. For each resource, note:

Verwandte Skills

Sky Down | Skills Pool

What would you like to tear down?
  1. Cluster 'train-01' (H100:8, aws/us-east-1, $12.40 so far)
  2. Cluster 'dev' (A100:1, gcp/us-central1, $3.20 so far)
  3. Managed job 42 'llama-sft' (RUNNING, A100:4)
  4. Service 'my-llm' (2 replicas, A100:1 each)
  5. All of the above

sky queue CLUSTER_NAME

WARNING: Cluster 'train-01' has 1 RUNNING task:
  Job 1: torchrun train.py (running for 2h 15m, step ~1240/5000)

  Tearing down will KILL this training run.
  Unsaved progress since last checkpoint will be LOST.

  Proceed? (Recommend: wait for completion or cancel the job first)

Managed job 42 'llama-sft' is currently RUNNING.
  To cancel this job: sky jobs cancel 42
  The underlying resources will be cleaned up automatically.

WARNING: Service 'my-llm' is ACTIVE with endpoint http://44.123.456.78:30001
  Tearing down will immediately make this endpoint unreachable.
  Any clients using this endpoint will get connection errors.

NOTE: Cluster uses MOUNT_CACHED for /checkpoints.
  Cached data should auto-sync, but verify your latest checkpoint
  is in the destination bucket before teardown.

COST ANALYSIS:
  Cluster 'train-01':
    Running for:     4h 32m
    Cost so far:     $28.80
    Hourly rate:     $6.40/hr

  Cluster 'dev':
    Running for:     8h 15m
    Cost so far:     $26.40
    Hourly rate:     $3.20/hr

  TOTAL SAVINGS: $9.60/hr by tearing down both clusters
  PROJECTED SAVINGS: $230.40/day

  Cluster 'old-exp' (STOPPED):
    Disk cost:       ~$0.10/day (512 GB)
    Recommendation:  Tear down to eliminate disk charges

TEARDOWN PLAN:
  1. sky down train-01  -- Release H100:8 in aws/us-east-1
  2. sky down dev        -- Release A100:1 in gcp/us-central1

  Total savings: $9.60/hr ($230.40/day)

  Proceed with teardown?

sky down CLUSTER_NAME -y

sky jobs cancel JOB_ID -y

sky serve down SERVICE_NAME -y

# Cancel all managed jobs first
sky jobs cancel -a -y

# Tear down all services
sky serve down SERVICE_NAME -y  # for each service

# Tear down all clusters
sky down -a -y

sky status

sky jobs queue

sky serve status

=== TEARDOWN COMPLETE ===

Torn down:
  - Cluster 'train-01' (H100:8) -- REMOVED
  - Cluster 'dev' (A100:1) -- REMOVED

Remaining:
  - No active clusters
  - 1 managed job (ID 41, SUCCEEDED -- will auto-clean)
  - No active services

Cost savings: $9.60/hr ($230.40/day)
Total cost of torn-down resources: $55.20

ERROR: Failed to tear down cluster 'train-01':
  Cloud API error: Instance not found

  This can happen if the instance was already terminated by the cloud provider.
  Try: sky down train-01 --purge
  This removes SkyPilot's local record without contacting the cloud.

Sky Down

Sky Down -- Safe Cluster and Job Teardown

Step 1: Survey Current Infrastructure

Sky Down

Sky Down -- Safe Cluster and Job Teardown

Step 1: Survey Current Infrastructure

Step 2: Identify Targets

Step 3: Safety Checks

Check for Running Tasks on Clusters

Check for Running Managed Jobs

Check for Active Services

Check for Unsaved Checkpoints

Step 4: Cost Analysis

Step 5: Confirm and Execute

Tearing Down Clusters

Cancelling Managed Jobs

Tearing Down Services

Tearing Down Everything

Step 6: Verify and Report

Handling Teardown Failures

Cleanup Reminder

Reference

Helm Chart Scaffolding

Python Observability

K8s Manifest Generator

Istio Traffic Management

Secrets Management

Gitops Workflow