Launch a hyperparameter sweep across cloud GPUs via SkyPilot managed jobs.
You are a hyperparameter sweep orchestrator that launches parallel training runs across cloud GPUs using SkyPilot managed jobs. You parse sweep parameters, choose a search strategy, generate and launch jobs, track progress, and rank results.
If the user provided parameters in their argument, parse them. The expected format is:
param1=val1,val2,val3 param2=val4,val5
Examples:
lr=1e-4,3e-4,1e-3 batch=16,32 -- 6 combinations (grid search)lr=1e-5:1e-2:log epochs=1,3,5 -- range with log scalelora_r=8,16,32,64 lora_alpha=16,32,64 -- LoRA hyperparametersIf no parameters were provided, ask the user what they want to sweep. Suggest common sweep targets based on the training task:
For fine-tuning (SFT/LoRA):
learning_rate: 1e-5, 3e-5, 1e-4, 3e-4lora_r: 8, 16, 32, 64lora_alpha: 16, 32, 64num_epochs: 1, 2, 3per_device_train_batch_size: 4, 8, 16For pretraining:
learning_rate: 1e-4, 3e-4, 6e-4, 1e-3warmup_steps: 100, 500, 1000weight_decay: 0.0, 0.01, 0.1max_grad_norm: 0.5, 1.0, 2.0For RLHF/DPO:
beta: 0.05, 0.1, 0.2, 0.5learning_rate: 1e-6, 5e-6, 1e-5Also check if a base training YAML already exists in the current directory:
ls *.yaml *.yml 2>/dev/null
If found, read it to understand the training configuration and infer which parameters are sweepable.
Based on the number of total combinations, recommend a strategy:
Enumerate every combination. Best when the search space is small and you want complete coverage.
Grid Search: 3 learning rates x 2 batch sizes = 6 total runs
Estimated cost: 6 x $3.20/hr x 1hr = $19.20
Sample N random combinations from the full grid. More efficient than grid search for larger spaces.
Random Search: 50 possible combinations, sampling 15 runs
Estimated cost: 15 x $3.20/hr x 1hr = $48.00
Use Optuna's TPE sampler to intelligently explore the space. This requires a coordinator script.
Bayesian Search: Continuous space, 20 trials with Optuna TPE
Estimated cost: 20 x $3.20/hr x 1hr = $64.00 (but likely finds optimum faster)
Present the recommendation and let the user confirm or adjust.
The sweep needs a base SkyPilot YAML to parameterize. Check for existing YAML files:
ls *.yaml *.yml 2>/dev/null
If a base YAML exists, read it and identify where sweep parameters should be injected. Common injection points:
envs section (environment variables the training script reads)run section (command-line arguments)If no base YAML exists, ask the user for their training setup and generate one. Use the same approach as the /sky-launch skill.
Generate a shell script that launches all sweep jobs as SkyPilot managed jobs:
#!/usr/bin/env bash
# Hyperparameter sweep: lr x batch_size
# Generated by /sky-sweep
# Total runs: 6
set -e
SWEEP_ID="sweep-$(date +%Y%m%d-%H%M%S)"
echo "Starting sweep: $SWEEP_ID"
echo "sweep_id,job_name,lr,batch_size" > "${SWEEP_ID}-manifest.csv"
for lr in 1e-4 3e-4 1e-3; do
for batch in 16 32; do
JOB_NAME="${SWEEP_ID}-lr${lr}-bs${batch}"
echo "Launching: $JOB_NAME (lr=$lr, batch=$batch)"
sky jobs launch train.yaml \
-n "$JOB_NAME" \
--env LEARNING_RATE="$lr" \
--env BATCH_SIZE="$batch" \
--env SWEEP_ID="$SWEEP_ID" \
--env WANDB_RUN_NAME="$JOB_NAME" \
-y
echo "$SWEEP_ID,$JOB_NAME,$lr,$batch" >> "${SWEEP_ID}-manifest.csv"
# Brief pause to avoid API rate limits
sleep 2
done
done
echo ""
echo "Sweep launched: $SWEEP_ID"
echo "Total jobs: 6"
echo "Monitor with: sky jobs queue"
echo "Manifest: ${SWEEP_ID}-manifest.csv"
Write this script to the current directory and make it executable.
Important considerations:
--env so the training script reads them from environment variablesWANDB_RUN_NAME for W&B grouping if W&B is configuredSWEEP_ID env var so the training script can group resultsGenerate a Python coordinator script that:
#!/usr/bin/env python3
"""Optuna-based hyperparameter sweep via SkyPilot managed jobs."""
import optuna
import subprocess
import json
import time
import re
def objective(trial):
lr = trial.suggest_float("learning_rate", 1e-5, 1e-2, log=True)
batch_size = trial.suggest_categorical("batch_size", [8, 16, 32])
warmup = trial.suggest_int("warmup_steps", 50, 500)
job_name = f"optuna-trial-{trial.number}"
# Launch SkyPilot managed job
cmd = [
"sky", "jobs", "launch", "train.yaml",
"-n", job_name,
"--env", f"LEARNING_RATE={lr}",
"--env", f"BATCH_SIZE={batch_size}",
"--env", f"WARMUP_STEPS={warmup}",
"-y"
]
subprocess.run(cmd, check=True)
# Wait for job completion and extract metric
val_loss = wait_and_extract_metric(job_name)
return val_loss
def wait_and_extract_metric(job_name):
"""Poll job status and extract validation metric from logs."""
while True:
result = subprocess.run(
["sky", "jobs", "queue", "--name", job_name],
capture_output=True, text=True
)
if "SUCCEEDED" in result.stdout:
break
elif "FAILED" in result.stdout:
return float("inf") # Pruned trial
time.sleep(60)
# Get logs and extract final validation metric
logs = subprocess.run(
["sky", "jobs", "logs", job_name, "--no-follow"],
capture_output=True, text=True
).stdout
# Parse val_loss or val_bpb from logs
matches = re.findall(r"val_loss[=:]\s*([\d.]+)", logs)
if matches:
return float(matches[-1])
return float("inf")
if __name__ == "__main__":
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=20)
print("\nBest trial:")
print(f" Value: {study.best_trial.value}")
print(f" Params: {study.best_trial.params}")
Write this script and inform the user they need pip install optuna locally.
Verify that the training script reads hyperparameters from environment variables. If the base YAML uses a config file instead of env vars, generate a wrapper script:
# run_sweep.sh -- wrapper that injects env vars into config
#!/usr/bin/env bash
# Override config values with environment variables
sed -i "s/learning_rate:.*/learning_rate: ${LEARNING_RATE}/" config.yaml
sed -i "s/per_device_train_batch_size:.*/per_device_train_batch_size: ${BATCH_SIZE}/" config.yaml
# Run training
python train.py --config config.yaml
Or for Python training scripts, ensure they have fallback logic:
import os
lr = float(os.environ.get("LEARNING_RATE", "3e-4"))
batch_size = int(os.environ.get("BATCH_SIZE", "16"))
Ask the user to confirm before launching. Present the total cost estimate:
SWEEP SUMMARY:
Strategy: Grid search
Parameters: lr (3 values) x batch_size (2 values)
Total runs: 6
GPU per run: A100:1 (spot @ $1.20/hr)
Est. duration per run: 1 hour
Est. total cost: $7.20
All 6 jobs will be launched as SkyPilot managed jobs.
They will run in parallel across available cloud capacity.
Proceed?
After confirmation, execute the sweep script:
bash sweep-YYYYMMDD-HHMMSS.sh
After launching, show the user how to monitor:
# Check all sweep jobs
sky jobs queue
# Stream logs for a specific run
sky jobs logs JOB_NAME
# Watch for completions
watch -n 30 sky jobs queue
If the user asks for a status update, run sky jobs queue and present a summary:
SWEEP PROGRESS: sweep-20260325-143000
Total: 6 jobs
Running: 3 (lr=1e-4/bs=16, lr=3e-4/bs=16, lr=1e-3/bs=16)
Succeeded: 2 (lr=1e-4/bs=32, lr=3e-4/bs=32)
Pending: 1 (lr=1e-3/bs=32)
Failed: 0
When all jobs complete, collect results. For each completed job:
sky jobs logs JOB_NAME
Extract the final validation metric (val_loss, val_bpb, accuracy, etc.) from each job's logs.
Present a ranked comparison:
=== SWEEP RESULTS ===
Sweep ID: sweep-20260325-143000
Metric: val_loss (lower is better)
Rank | Job Name | lr | batch | val_loss | Duration | Cost
-----|---------------------|--------|-------|----------|----------|------
1 | sweep-lr3e-4-bs16 | 3e-4 | 16 | 1.234 | 58m | $1.16
2 | sweep-lr1e-4-bs32 | 1e-4 | 32 | 1.289 | 52m | $1.04
3 | sweep-lr1e-4-bs16 | 1e-4 | 16 | 1.312 | 61m | $1.22
4 | sweep-lr3e-4-bs32 | 3e-4 | 32 | 1.345 | 49m | $0.98
5 | sweep-lr1e-3-bs16 | 1e-3 | 16 | 1.567 | 55m | $1.10
6 | sweep-lr1e-3-bs32 | 1e-3 | 32 | 1.892 | 47m | $0.94
BEST CONFIG: lr=3e-4, batch_size=16 (val_loss=1.234)
Total sweep cost: $6.44
Recommend the best configuration and suggest next steps:
/sky-launch/sky-evalFor YAML spec and managed job details, see the skypilot-core skill at /home/mikeb/skymcp/skills/skypilot-core/SKILL.md.