Skill File

Exec Remote

Name: Exec Remote
Author: primatrix

Executes Python scripts, tests, or benchmarks on a provisioned remote cluster (GPU or TPU) using SkyPilot. Use this skill when the user asks to run code on GPU, TPU, or any "remote" cluster.

primatrix0 starsMar 12, 2026

Occupation
Categories: System Administration

Skill Content

Remote Execution Skill

This skill handles running code on remote GPU or TPU clusters via SkyPilot.

Defaults

The following defaults apply unless the user explicitly overrides them:

Parameter	Default
PROJECT_ID	`tpu-service-473302`
CLUSTER_NAME	`sglang-jax-agent-tests`
ZONE	`asia-northeast1-b`
NUM_SLICES	`1`

Use these values directly — do NOT ask the user to confirm or re-enter them unless they specify otherwise.

1. Determine Target Device

Identify the target device from the user's request:

Target	Cluster name file	Env prefix

Related Skills

Exec Remote | Skills Pool

# Common accelerator types: H100:1, A100:1, L4:1
bash <absolute_path_to_launch_gpu.sh> <accelerator_type> <experiment_name>

/deploy-cluster

# Common accelerator types: tpu-v4-8, tpu-v4-16, tpu-v6e-1, tpu-v6e-4
bash <absolute_path_to_launch_tpu.sh> <accelerator_type> <experiment_name>

# GPU
sky down $(cat .cluster_name_gpu) -y

# TPU (tear down all per-TPU-type clusters)
sky down <CLUSTER_NAME>-<USERNAME>-v6e-1 -y
sky down <CLUSTER_NAME>-<USERNAME>-v6e-4 -y

sky exec $(cat .cluster_name_gpu) --workdir . "export CUDA_VISIBLE_DEVICES=0; uv run --extra gpu python <PATH_TO_SCRIPT> [ARGS]"

sky exec <CLUSTER_NAME>-<USERNAME>-<TPU_TYPE> --workdir . "uv run --extra tpu python <PATH_TO_SCRIPT> [ARGS]"

sky exec $(cat .cluster_name_gpu) --workdir . "export CUDA_VISIBLE_DEVICES=0; uv run --extra gpu python src/lynx/perf/benchmark_train.py"

sky exec sglang-jax-agent-tests-hongmao-v6e-4 --workdir . "uv run --extra tpu python -m pytest src/lynx/test/"

# Deploy both types (sequential — config.yaml is global)
python <deploy-cluster>/scripts/deploy.py sglang-jax-agent-tests v6e-1 asia-northeast1-b
python <deploy-cluster>/scripts/deploy.py sglang-jax-agent-tests v6e-4 asia-northeast1-b

# Execute in parallel
sky exec sglang-jax-agent-tests-hongmao-v6e-1 --workdir . "python test/srt/run_suite.py --suite unit-test-tpu-v6e-1" &
sky exec sglang-jax-agent-tests-hongmao-v6e-4 --workdir . "python test/srt/run_suite.py --suite e2e-test-tpu-v6e-4" &
wait

Parameter	Default	Notes
PROJECT_ID	`tpu-service-473302`	GCP project ID
CLUSTER_NAME	`sglang-jax-agent-tests`	GKE cluster name
TPU_TYPE	(must specify)	e.g. `v6e-4`, `v6e-1`
NUM_SLICES	`1`	Default to 1
ZONE	`asia-northeast1-b`	Must support the chosen TPU type

which xpk && which gcloud && which kubectl

xpk cluster create-pathways \
  --cluster $CLUSTER_NAME \
  --num-slices=$NUM_SLICES \
  --tpu-type=$TPU_TYPE \
  --zone=$ZONE \
  --spot \
  --project=$PROJECT_ID

gcloud container clusters list --project=$PROJECT_ID \
  --filter="name=$CLUSTER_NAME" --format="table(name,location,status)"

# Deploy each TPU type (must be sequential — config.yaml is global)
# Only tpu_type is required; cluster_name and zone use defaults
python <path-to-deploy-cluster>/scripts/deploy.py v6e-1
python <path-to-deploy-cluster>/scripts/deploy.py v6e-4

sky status                  # Both clusters should show as UP

# Single-node (v6e-1, v6e-4) — use per-TPU-type cluster name
sky exec $CLUSTER_NAME-$USERNAME-v6e-1 --workdir . \
  "uv run --extra tpu python <PATH_TO_SCRIPT> [ARGS]"

# Multi-node (v6e-8+)
sky exec $CLUSTER_NAME-$USERNAME-v6e-8 --num-nodes 2 --workdir . \
  "uv run --extra tpu python <PATH_TO_SCRIPT> [ARGS]"

# Parallel execution across multiple TPU types
sky exec $CLUSTER_NAME-$USERNAME-v6e-1 --workdir . "..." &
sky exec $CLUSTER_NAME-$USERNAME-v6e-4 --workdir . "..." &
wait

# 1. Remove SkyPilot clusters (one per TPU type)
sky down $CLUSTER_NAME-$USERNAME-v6e-1 -y
sky down $CLUSTER_NAME-$USERNAME-v6e-4 -y

# 2. Remove GKE cluster (only for Path A / GKE-based)
xpk cluster delete \
  --cluster $CLUSTER_NAME \
  --zone=$ZONE \
  --project=$PROJECT_ID

GPU	`.cluster_name_gpu`	`export CUDA_VISIBLE_DEVICES=0;`
TPU	`.cluster_name_tpu`	(none)

TPU type	num_nodes
v6e-1	1
v6e-4	1
v6e-8	2
v6e-16	4
v6e-32	8
v6e-64	16
v6e-128	32
v6e-256	64

Exec Remote

Remote Execution Skill

Defaults

1. Determine Target Device

Exec Remote

Remote Execution Skill

Defaults

1. Determine Target Device

2. Prerequisites

3. Cluster Provisioning

GPU (Standalone SkyPilot)

TPU

Path A: GKE-based (via `deploy-cluster` skill) — Recommended

Path B: Standalone SkyPilot TPU VM

Teardown

4. Execution Command

GPU

TPU

Common flags

5. Usage Examples

6. Operational Notes

7. GKE TPU Full Pipeline Procedure (Path A)

7.1 Collect Parameters

7.2 Create GKE Cluster (apply-resource)

7.3 Wait for GKE Cluster Ready

7.4 Deploy SkyPilot on GKE (deploy-cluster)

7.5 Execute User Code (exec-remote)

7.6 Cleanup

Mcporter

Sonoscli

Openhue

Healthcheck

Things Mac

Eightctl

Exec Remote

Remote Execution Skill

Defaults

1. Determine Target Device

Exec Remote

Remote Execution Skill

Defaults

1. Determine Target Device

2. Prerequisites

3. Cluster Provisioning

GPU (Standalone SkyPilot)

TPU

Path A: GKE-based (via deploy-cluster skill) — Recommended

Path B: Standalone SkyPilot TPU VM

Teardown

4. Execution Command

GPU

TPU

Common flags

5. Usage Examples

6. Operational Notes

7. GKE TPU Full Pipeline Procedure (Path A)

7.1 Collect Parameters

7.2 Create GKE Cluster (apply-resource)

7.3 Wait for GKE Cluster Ready

7.4 Deploy SkyPilot on GKE (deploy-cluster)

7.5 Execute User Code (exec-remote)

7.6 Cleanup

Mcporter

Sonoscli

Openhue

Healthcheck

Things Mac

Eightctl

Path A: GKE-based (via `deploy-cluster` skill) — Recommended