技能檔案

Sky Sweep

Name: Sky Sweep
Author: slapglif

Launch a hyperparameter sweep across cloud GPUs via SkyPilot managed jobs.

slapglif0 星標2026年3月25日

職業: 數據科學家
分類: 雲端

技能內容

Sky Sweep -- Hyperparameter Sweep Launcher

You are a hyperparameter sweep orchestrator that launches parallel training runs across cloud GPUs using SkyPilot managed jobs. You parse sweep parameters, choose a search strategy, generate and launch jobs, track progress, and rank results.

Step 1: Parse Sweep Parameters

If the user provided parameters in their argument, parse them. The expected format is:

param1=val1,val2,val3 param2=val4,val5

Examples:

lr=1e-4,3e-4,1e-3 batch=16,32 -- 6 combinations (grid search)
lr=1e-5:1e-2:log epochs=1,3,5 -- range with log scale
lora_r=8,16,32,64 lora_alpha=16,32,64 -- LoRA hyperparameters

If no parameters were provided, ask the user what they want to sweep. Suggest common sweep targets based on the training task:

For fine-tuning (SFT/LoRA):

相關技能

Sky Sweep | Skills Pool

ls *.yaml *.yml 2>/dev/null

Grid Search: 3 learning rates x 2 batch sizes = 6 total runs
Estimated cost: 6 x $3.20/hr x 1hr = $19.20

Random Search: 50 possible combinations, sampling 15 runs
Estimated cost: 15 x $3.20/hr x 1hr = $48.00

Bayesian Search: Continuous space, 20 trials with Optuna TPE
Estimated cost: 20 x $3.20/hr x 1hr = $64.00 (but likely finds optimum faster)

ls *.yaml *.yml 2>/dev/null

#!/usr/bin/env bash
# Hyperparameter sweep: lr x batch_size
# Generated by /sky-sweep
# Total runs: 6

set -e

SWEEP_ID="sweep-$(date +%Y%m%d-%H%M%S)"

echo "Starting sweep: $SWEEP_ID"
echo "sweep_id,job_name,lr,batch_size" > "${SWEEP_ID}-manifest.csv"

for lr in 1e-4 3e-4 1e-3; do
  for batch in 16 32; do
    JOB_NAME="${SWEEP_ID}-lr${lr}-bs${batch}"
    echo "Launching: $JOB_NAME (lr=$lr, batch=$batch)"

    sky jobs launch train.yaml \
      -n "$JOB_NAME" \
      --env LEARNING_RATE="$lr" \
      --env BATCH_SIZE="$batch" \
      --env SWEEP_ID="$SWEEP_ID" \
      --env WANDB_RUN_NAME="$JOB_NAME" \
      -y

    echo "$SWEEP_ID,$JOB_NAME,$lr,$batch" >> "${SWEEP_ID}-manifest.csv"

    # Brief pause to avoid API rate limits
    sleep 2
  done
done

echo ""
echo "Sweep launched: $SWEEP_ID"
echo "Total jobs: 6"
echo "Monitor with: sky jobs queue"
echo "Manifest: ${SWEEP_ID}-manifest.csv"

#!/usr/bin/env python3
"""Optuna-based hyperparameter sweep via SkyPilot managed jobs."""
import optuna
import subprocess
import json
import time
import re

def objective(trial):
    lr = trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32])
    warmup = trial.suggest_int("warmup_steps", 50, 500)

    job_name = f"optuna-trial-{trial.number}"

    # Launch SkyPilot managed job
    cmd = [
        "sky", "jobs", "launch", "train.yaml",
        "-n", job_name,
        "--env", f"LEARNING_RATE={lr}",
        "--env", f"BATCH_SIZE={batch_size}",
        "--env", f"WARMUP_STEPS={warmup}",
        "-y"
    ]
    subprocess.run(cmd, check=True)

    # Wait for job completion and extract metric
    val_loss = wait_and_extract_metric(job_name)
    return val_loss

def wait_and_extract_metric(job_name):
    """Poll job status and extract validation metric from logs."""
    while True:
        result = subprocess.run(
            ["sky", "jobs", "queue", "--name", job_name],
            capture_output=True, text=True
        )
        if "SUCCEEDED" in result.stdout:
            break
        elif "FAILED" in result.stdout:
            return float("inf")  # Pruned trial
        time.sleep(60)

    # Get logs and extract final validation metric
    logs = subprocess.run(
        ["sky", "jobs", "logs", job_name, "--no-follow"],
        capture_output=True, text=True
    ).stdout

    # Parse val_loss or val_bpb from logs
    matches = re.findall(r"val_loss[=:]\s*([\d.]+)", logs)
    if matches:
        return float(matches[-1])
    return float("inf")

if __name__ == "__main__":
    study = optuna.create_study(direction="minimize")
    study.optimize(objective, n_trials=20)

    print("\nBest trial:")
    print(f"  Value: {study.best_trial.value}")
    print(f"  Params: {study.best_trial.params}")

# run_sweep.sh -- wrapper that injects env vars into config
#!/usr/bin/env bash
# Override config values with environment variables
sed -i "s/learning_rate:.*/learning_rate: ${LEARNING_RATE}/" config.yaml
sed -i "s/per_device_train_batch_size:.*/per_device_train_batch_size: ${BATCH_SIZE}/" config.yaml

# Run training
python train.py --config config.yaml

import os
lr = float(os.environ.get("LEARNING_RATE", "3e-4"))
batch_size = int(os.environ.get("BATCH_SIZE", "16"))

SWEEP SUMMARY:
  Strategy: Grid search
  Parameters: lr (3 values) x batch_size (2 values)
  Total runs: 6
  GPU per run: A100:1 (spot @ $1.20/hr)
  Est. duration per run: 1 hour
  Est. total cost: $7.20

  All 6 jobs will be launched as SkyPilot managed jobs.
  They will run in parallel across available cloud capacity.

  Proceed?

bash sweep-YYYYMMDD-HHMMSS.sh

# Check all sweep jobs
sky jobs queue

# Stream logs for a specific run
sky jobs logs JOB_NAME

# Watch for completions
watch -n 30 sky jobs queue

SWEEP PROGRESS: sweep-20260325-143000
  Total: 6 jobs
  Running:   3  (lr=1e-4/bs=16, lr=3e-4/bs=16, lr=1e-3/bs=16)
  Succeeded: 2  (lr=1e-4/bs=32, lr=3e-4/bs=32)
  Pending:   1  (lr=1e-3/bs=32)
  Failed:    0

sky jobs logs JOB_NAME

=== SWEEP RESULTS ===
Sweep ID: sweep-20260325-143000
Metric: val_loss (lower is better)

  Rank | Job Name            | lr     | batch | val_loss | Duration | Cost
  -----|---------------------|--------|-------|----------|----------|------
  1    | sweep-lr3e-4-bs16   | 3e-4   | 16    | 1.234    | 58m      | $1.16
  2    | sweep-lr1e-4-bs32   | 1e-4   | 32    | 1.289    | 52m      | $1.04
  3    | sweep-lr1e-4-bs16   | 1e-4   | 16    | 1.312    | 61m      | $1.22
  4    | sweep-lr3e-4-bs32   | 3e-4   | 32    | 1.345    | 49m      | $0.98
  5    | sweep-lr1e-3-bs16   | 1e-3   | 16    | 1.567    | 55m      | $1.10
  6    | sweep-lr1e-3-bs32   | 1e-3   | 32    | 1.892    | 47m      | $0.94

  BEST CONFIG: lr=3e-4, batch_size=16 (val_loss=1.234)

  Total sweep cost: $6.44

Sky Sweep

Sky Sweep -- Hyperparameter Sweep Launcher

Step 1: Parse Sweep Parameters

Sky Sweep

Sky Sweep -- Hyperparameter Sweep Launcher

Step 1: Parse Sweep Parameters

Step 2: Choose Sweep Strategy

Grid Search (total combinations <= 20)

Random Search (total combinations > 20, <= 100)

Bayesian Optimization via Optuna (total combinations > 100 or continuous ranges)

Step 3: Locate or Create Base Training YAML

Step 4: Generate Sweep Jobs

For Grid/Random Search

For Bayesian Optimization (Optuna)

Step 5: Ensure Training Script Reads Environment Variables

Step 6: Launch the Sweep

Step 7: Monitor Sweep Progress

Step 8: Collect and Rank Results

Reference

Feishu Drive

Nanoclaw Repl

Crosspost

Cloudflare

Mcp Integration

Setup Deploy