Build the Slurm compute node container image via CodeBuild or local Docker, generate SSH keys, and render Helm values using setup.sh
Use this skill to build the Slurmd DLC (Deep Learning Container) image, generate
SSH keys for login node access, and render the slurm-values.yaml Helm values
file from the template. This is Phase 2 of the slinky-slurm deployment
workflow.
The setup.sh script handles three image build paths:
| Mode | Flag | When to Use |
|---|---|---|
| CodeBuild (default) | (none) | Production builds in AWS (creates a CodeBuild project to build the image) |
| Local Docker | --local-build | Development/testing on a local machine with Docker |
| Skip Build | --skip-build | Image already exists in ECR from a prior build |
After the image build, setup.sh always:
~/.ssh/id_ed25519_slurm) if not presentslurm-values.yaml from slurm-values.yaml.template with
profile-specific values (GPU count, EFA count, GRES, replicas, etc.)--instance-type and --infra flags decidedenv_vars.sh available (from deploy.sh) for AWS_ACCOUNT_ID, AWS_REGIONzip command availablejq (for CFN) or terraform (for TF)763104351884.dkr.ecr.us-east-1.amazonaws.com)See the deployment-preflight skill for full prerequisite validation.
CodeBuild (default) -- recommended for production:
# Build via CodeBuild for g5 instances using CloudFormation
bash setup.sh --instance-type ml.g5.8xlarge --infra cfn
# Build via CodeBuild for p5 instances using Terraform
bash setup.sh --instance-type ml.p5.48xlarge --instance-count 2 --infra tf
Local Docker -- for development/testing:
# Build locally for g5 instances
bash setup.sh --instance-type ml.g5.8xlarge --infra cfn --local-build
# Build locally for p5 instances
bash setup.sh --instance-type ml.p5.48xlarge --instance-count 2 --infra tf --local-build
Skip Build -- image already in ECR:
# Use existing ECR image
bash setup.sh --instance-type ml.g5.8xlarge --infra cfn --skip-build
# With custom repo name and tag
bash setup.sh --instance-type ml.g5.8xlarge --infra cfn --skip-build \
--repo-name my-slurmd --tag v1.0
After setup.sh completes:
# Verify slurm-values.yaml was generated
cat slurm-values.yaml | head -20
# Check for unresolved template variables (should find none)
grep '${' slurm-values.yaml
# Expected: no output (all variables substituted)
# Verify SSH key exists
ls -la ~/.ssh/id_ed25519_slurm*
lib/deploy_helpers.sh and calls resolve_helm_profile()aws sts get-caller-identityenv_vars.sh if available (for AWS_ACCOUNT_ID, AWS_REGION)dlc-slurmd-codebuild-<account_id>-<region>dlc-slurmd.Dockerfile + buildspec.yml into a zip files3://<bucket>/codebuild/slurmd-build-context.zipaws cloudformation create-stack with
codebuild-stack.yamlterraform apply with codebuild.tfaws codebuild start-build<account>.dkr.ecr.<region>.amazonaws.com/dlc-slurmd:<tag>--local-build)us-east-1:
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS \
--password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
docker buildx build --platform linux/amd64docker build--skip-build)aws ecr describe-images \
--repository-name dlc-slurmd \
--image-ids imageTag=25.11.1-ubuntu24.04
~/.ssh/id_ed25519_slurmssh-keygenslurm-values.yaml for login node accessCalls resolve_helm_profile() which sets these variables based on
--instance-type:
| Variable | g5 Value | p5 Value |
|---|---|---|
HELM_ACCEL_INSTANCE_TYPE | ml.g5.8xlarge | ml.p5.48xlarge |
GPU_COUNT | 1 | 8 |
EFA_COUNT | 1 | 32 |
GPU_GRES | gpu:a10g:1 | gpu:h100:8 |
REPLICAS | 4 | 2 |
MGMT_INSTANCE_TYPE | ml.m5.4xlarge | ml.m5.4xlarge |
PVC_NAME | fsx-claim | fsx-claim |
Then uses sed to substitute 10 template variables in
slurm-values.yaml.template:
${image_repository}, ${image_tag}, ${ssh_key}${mgmt_instance_type}, ${accel_instance_type}${gpu_count}, ${efa_count}, ${gpu_gres}${replicas}, ${pvc_name}Usage: setup.sh --instance-type <ml.X.Y> --infra <cfn|tf> [OPTIONS]
Required:
--instance-type <type> SageMaker instance type for GPU/EFA/GRES resolution
--infra <cfn|tf> Infrastructure method for CodeBuild stack
Optional:
--instance-count <N> Number of compute node replicas (default: varies by instance type)
--repo-name <name> ECR repository name (default: dlc-slurmd)
--tag <tag> Image tag (default: 25.11.1-ubuntu24.04)
--region <region> AWS region (default: AWS CLI configured or us-west-2)
--local-build Build image locally instead of CodeBuild
--skip-build Skip image build (use existing image in ECR)
--help Show help
Build is successful when:
setup.sh exits with code 0slurm-values.yaml exists in the project directory${...} variables in slurm-values.yaml~/.ssh/id_ed25519_slurmaws ecr describe-images)# Quick verification
ls -la slurm-values.yaml
grep -c '${' slurm-values.yaml # Should return 0 / exit 1
ls ~/.ssh/id_ed25519_slurm
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
aws ecr describe-images \
--repository-name dlc-slurmd \
--image-ids imageTag=25.11.1-ubuntu24.04 \
--region "${AWS_REGION:-us-west-2}"
| Symptom | Cause | Fix |
|---|---|---|
| CodeBuild FAILED status | Dockerfile build error or DLC base image pull failure | Check CodeBuild logs: aws codebuild batch-get-builds --ids <build-id> |
| Docker login fails for DLC ECR | Region mismatch | DLC registry is always in us-east-1, not the deployment region |
docker buildx fails on macOS | Docker Desktop not running or buildx not enabled | Start Docker Desktop; ensure buildx is available: docker buildx version |
ECR image not found (--skip-build) | Wrong repo name, tag, or region | Verify with aws ecr describe-images --repository-name <name> |
| Template variables not substituted | resolve_helm_profile failed | Check that --instance-type is a valid instance type |
| S3 bucket creation fails | Bucket name already taken | The bucket name includes account ID and region; check IAM permissions |
| CodeBuild stack already exists | Prior run created it | Script handles this gracefully (skips creation) |
setup.sh -- Main image build and values generation scriptlib/deploy_helpers.sh -- resolve_helm_profile() function (lines 43-70)dlc-slurmd.Dockerfile -- Multi-stage Dockerfile for Slurm compute nodeslurm-values.yaml.template -- Helm values template with 10 variablesbuildspec.yml -- CodeBuild build specificationcodebuild-stack.yaml -- CloudFormation template for CodeBuild projectcodebuild.tf -- Terraform config for CodeBuild project