Compile TensorRT-LLM on a SLURM cluster. Covers submitting a batch job with a container image, monitoring the job, and verifying the build. Use when the user wants to compile TRT-LLM remotely via SLURM rather than on a local compute node.
Submit, monitor, and verify a TensorRT-LLM compilation job on a SLURM cluster using enroot containers.
| Scenario | Use This Skill? |
|---|---|
| User wants to compile TRT-LLM on a SLURM cluster | Yes |
| User is already on a compute node and wants to compile | No — use exec-local-compile skill instead |
The official Docker image tag for a given TensorRT-LLM version is recorded in the repo itself:
<repo_dir>/jenkins/current_image_tags.properties
Read this file to find the current image URL (e.g., urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:pytorch-25.12-py3-aarch64-ubuntu24.04-trt10.14.1.48-skip-tritondevel-202602011118-10901).
SLURM clusters using enroot/pyxis require a .sqsh container image. To avoid download overhead at compile time, pre-dump the image in advance using the enroot-import companion script:
# Basic usage — submits a SLURM job on a CPU partition to import the image
enroot-import --partition cpu_datamover --debug <docker_image_url>
The script submits an sbatch job that runs enroot import docker://<image_url> and produces a .sqsh file in the current directory. The output on stdout is the SLURM job ID.
| Flag | Description |
|---|---|
-p, --partition | SLURM partition for the import job (use a CPU partition like cpu_datamover) |
-d, --debug | Enable debug output and preserve the SLURM log (recommended) |
-o, --output | Custom output path for the .sqsh file |
-A, --account | SLURM account (defaults to user's first account) |
-t, --time | Time limit for the import job (default: 1 hour) |
-n, --just-print | Print the sbatch command without executing |
-J, --job-name | Custom job name |
jenkins/current_image_tags.properties in the TRT-LLM repo.enroot-import to submit the import job:
cd <directory_where_sqsh_should_be_stored>
<path_to>/enroot-import --partition cpu_datamover --debug <image_url>
IMPORTANT: Convert urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:xxx to urm.nvidia.com#sw-tensorrt-docker/tensorrt-llm:xxx to avoid credential issues.squeue -j <job_id>)..sqsh file is the container_image used in the compile step.The user must provide (or you must ask for) these values:
| Parameter | Description | Example |
|---|---|---|
container_image | Path to .sqsh container image (see enroot import above) | /path/to/pytorch.sqsh |
repo_dir | Path to the TensorRT-LLM repository | /path/to/TensorRT-LLM |
mount_dir | Top-level directory to bind-mount into the container | /shared/users |
partition | SLURM partition | batch |
account | SLURM account | my_account |
Optional parameters:
| Parameter | Description | Default |
|---|---|---|
jobname | SLURM job name | trtllm-compile.<username> |
gpu_count | Number of GPUs to request | 4 |
time_limit | Job time limit | 02:00:00 |
arch | GPU architecture(s) for -a flag | 100-real |
extra_build_args | Extra flags for build_wheel.py | (none) |
This skill includes three companion scripts in scripts/:
| Script | Purpose |
|---|---|
enroot-import | Pre-dump a Docker image to .sqsh via a SLURM batch job |
submit_compile.sh | Template for submitting the SLURM job — copy and customize |
compile.slurm | SLURM batch script — launches the container and calls compile.sh |
compile.sh | Runs inside the container — executes build_wheel.py |
Scripts directory: skills/exec-slurm-compile/scripts/
Follow these steps in order:
If the user does not already have a .sqsh container image:
<repo_dir>/jenkins/current_image_tags.properties.enroot-import to pre-dump it:
cd <directory_for_sqsh_files>
<scripts_dir>/enroot-import --partition cpu_datamover --debug <image_url>
squeue -j <job_id>..sqsh file path becomes the container_image parameter.If the user already has a .sqsh file, skip this step.
Ask the user for any missing prerequisite values listed above. At minimum you need:
container_image (or the Docker image URL — then run Step 0 first)repo_dirmount_dirpartition and accountIf the user has used this workflow before, check if previous values are stored in memory files.
The compile scripts must be accessible from inside the container (i.e., under mount_dir). Either:
Option A — Copy companion scripts to a location under mount_dir:
scripts_dir=<mount_dir>/<username>/workspace/tensorrt_llm_scripts
mkdir -p ${scripts_dir}/log
cp skills/exec-slurm-compile/scripts/compile.sh ${scripts_dir}/
cp skills/exec-slurm-compile/scripts/compile.slurm ${scripts_dir}/
chmod +x ${scripts_dir}/compile.sh ${scripts_dir}/compile.slurm
Option B — If the user already has scripts at a known location, use those directly.
Run sbatch from the login node (or a node with SLURM client access):
sbatch \
--nodes=1 --ntasks=1 --ntasks-per-node=1 \
--gres=gpu:<gpu_count> \
--partition=<partition> \
--account=<account> \
--job-name=<jobname> \
--time=<time_limit> \
<scripts_dir>/compile.slurm \
<container_image> <mount_dir> <scripts_dir> <repo_dir>
Capture and report the job ID from the sbatch output.
You MUST actively poll the job until it completes. Do not submit and walk away.
# Check job status (repeat every 30-60 seconds)
squeue -j <job_id> -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"
# Once running, periodically tail the log (do NOT use tail -f, use tail -30 instead)
tail -30 <scripts_dir>/log/compile_<job_id>.srun.log
Monitoring loop:
squeue -j <job_id> to check statePD (pending) — report the reason, keep polling every 30-60sR (running) — tail the build log every 30-60s; look for [XX%] Building, errors, or completionsqueue, it has finished — proceed to Step 5F (failed) — immediately read the full log and report the errorProgress indicators to look for in the log:
[XX%] Building CXX object... — compilation progressLinking CXX... — link phaseFAILED:, error:, fatal error: — build failureSuccessfully built — successOnce the job completes, check for success:
# Check SLURM exit code
sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed
# Check the build log for errors
tail -50 <scripts_dir>/log/compile_<job_id>.srun.log
A successful build ends with a message like Successfully built tensorrt_llm or completes without error.
| Flag | Description |
|---|---|
--trt_root /usr/local/tensorrt | TensorRT installation path (standard in NVIDIA containers) |
--benchmarks | Build the C++ benchmarks |
-a "100-real" | Target architecture — 100 for Blackwell, 90 for Hopper, etc. |
--nvtx | Enable NVTX markers for profiling |
--no-venv | Skip virtual environment creation |
-ccache | Use ccache to speed up recompilation |
--skip_building_wheel | Build in-place without creating a wheel file |
-f | Fast build — skip some kernels for faster dev compilation |
-c | Clean build — wipe build directory before building |
Common architecture values:
"100-real" — Blackwell (B200, GB200)"90-real" — Hopper (H100, H200)"89-real" — Ada Lovelace (L40S)"80-real" — Ampere (A100)"90;100-real" — Multiple architectures| Issue | Solution |
|---|---|
sbatch: error: invalid partition | Verify partition name with sinfo -s |
sbatch: error: invalid account | Check available accounts with sacctmgr show assoc user=$USER |
| Container image not found | Verify the .sqsh path exists and is readable |
| Build fails with missing TensorRT | Ensure --trt_root points to the correct path inside the container |
| Build OOM (out of memory) | Reduce parallelism with -j <N> flag to build_wheel.py |
srun: error: Unable to create step | The node may lack enroot/pyxis — check with cluster admin |
Job stuck in PD state | Check squeue -j <id> -o %R for the reason (e.g., resource limits, priority) |
enroot import fails with auth error | Check ~/.config/enroot/.credentials has the correct registry credentials |
enroot import produces empty/corrupt .sqsh | Re-run with --debug and check the SLURM log; verify the image URL has no https:// prefix |
| Weird compile issues | Retry with a clean build (-c flag) |
QOSGrpNodeLimit shown in NODELIST(REASON) | Not a blocker, just wait for the job to get scheduled |
User: "Compile TRT-LLM on the OCI cluster"
Agent actions:
sbatchsqueue until complete