Name: Exec Slurm Compile
Author: NVIDIA

Search skills.../

Exec Slurm Compile | Skills Pool

# Basic usage — submits a SLURM job on a CPU partition to import the image
enroot-import --partition cpu_datamover --debug <docker_image_url>

Flag	Description
`-p, --partition`	SLURM partition for the import job (use a CPU partition like `cpu_datamover`)
`-d, --debug`	Enable debug output and preserve the SLURM log (recommended)
`-o, --output`	Custom output path for the `.sqsh` file
`-A, --account`	SLURM account (defaults to user's first account)
`-t, --time`	Time limit for the import job (default: 1 hour)
`-n, --just-print`	Print the sbatch command without executing
`-J, --job-name`	Custom job name

Read the image tag from jenkins/current_image_tags.properties in the TRT-LLM repo.
Run enroot-import to submit the import job:
```
cd <directory_where_sqsh_should_be_stored>
<path_to>/enroot-import --partition cpu_datamover --debug <image_url>
```
IMPORTANT: Convert urm.nvidia.com/sw-tensorrt-docker/tensorrt-llm:xxx to urm.nvidia.com#sw-tensorrt-docker/tensorrt-llm:xxx to avoid credential issues.
Wait for the import job to complete (squeue -j <job_id>).
The resulting .sqsh file is the container_image used in the compile step.

Parameter	Description	Example
`container_image`	Path to `.sqsh` container image (see enroot import above)	`/path/to/pytorch.sqsh`
`repo_dir`	Path to the TensorRT-LLM repository	`/path/to/TensorRT-LLM`
`mount_dir`	Top-level directory to bind-mount into the container	`/shared/users`
`partition`	SLURM partition	`batch`
`account`	SLURM account	`my_account`

Parameter	Description	Default
`jobname`	SLURM job name	`trtllm-compile.<username>`
`gpu_count`	Number of GPUs to request	`4`
`time_limit`	Job time limit	`02:00:00`
`arch`	GPU architecture(s) for `-a` flag	`100-real`
`extra_build_args`	Extra flags for `build_wheel.py`	(none)

Script	Purpose
`enroot-import`	Pre-dump a Docker image to `.sqsh` via a SLURM batch job
`submit_compile.sh`	Template for submitting the SLURM job — copy and customize
`compile.slurm`	SLURM batch script — launches the container and calls `compile.sh`
`compile.sh`	Runs inside the container — executes `build_wheel.py`

Read the Docker image tag from <repo_dir>/jenkins/current_image_tags.properties.

Use enroot-import to pre-dump it:

cd <directory_for_sqsh_files>
<scripts_dir>/enroot-import --partition cpu_datamover --debug <image_url>

Monitor the import job with squeue -j <job_id>.
Once complete, the .sqsh file path becomes the container_image parameter.

scripts_dir=<mount_dir>/<username>/workspace/tensorrt_llm_scripts
mkdir -p ${scripts_dir}/log
cp skills/exec-slurm-compile/scripts/compile.sh ${scripts_dir}/
cp skills/exec-slurm-compile/scripts/compile.slurm ${scripts_dir}/
chmod +x ${scripts_dir}/compile.sh ${scripts_dir}/compile.slurm

sbatch \
    --nodes=1 --ntasks=1 --ntasks-per-node=1 \
    --gres=gpu:<gpu_count> \
    --partition=<partition> \
    --account=<account> \
    --job-name=<jobname> \
    --time=<time_limit> \
    <scripts_dir>/compile.slurm \
    <container_image> <mount_dir> <scripts_dir> <repo_dir>

# Check job status (repeat every 30-60 seconds)
squeue -j <job_id> -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

# Once running, periodically tail the log (do NOT use tail -f, use tail -30 instead)
tail -30 <scripts_dir>/log/compile_<job_id>.srun.log

# Check SLURM exit code
sacct -j <job_id> --format=JobID,State,ExitCode,Elapsed

# Check the build log for errors
tail -50 <scripts_dir>/log/compile_<job_id>.srun.log

Flag	Description
`--trt_root /usr/local/tensorrt`	TensorRT installation path (standard in NVIDIA containers)
`--benchmarks`	Build the C++ benchmarks
`-a "100-real"`	Target architecture — `100` for Blackwell, `90` for Hopper, etc.
`--nvtx`	Enable NVTX markers for profiling
`--no-venv`	Skip virtual environment creation
`-ccache`	Use ccache to speed up recompilation
`--skip_building_wheel`	Build in-place without creating a wheel file
`-f`	Fast build — skip some kernels for faster dev compilation
`-c`	Clean build — wipe build directory before building

Issue	Solution
`sbatch: error: invalid partition`	Verify partition name with `sinfo -s`
`sbatch: error: invalid account`	Check available accounts with `sacctmgr show assoc user=$USER`
Container image not found	Verify the `.sqsh` path exists and is readable
Build fails with missing TensorRT	Ensure `--trt_root` points to the correct path inside the container
Build OOM (out of memory)	Reduce parallelism with `-j <N>` flag to `build_wheel.py`
`srun: error: Unable to create step`	The node may lack enroot/pyxis — check with cluster admin
Job stuck in `PD` state	Check `squeue -j <id> -o %R` for the reason (e.g., resource limits, priority)
`enroot import` fails with auth error	Check `~/.config/enroot/.credentials` has the correct registry credentials
`enroot import` produces empty/corrupt `.sqsh`	Re-run with `--debug` and check the SLURM log; verify the image URL has no `https://` prefix
Weird compile issues	Retry with a clean build (`-c` flag)
`QOSGrpNodeLimit` shown in `NODELIST(REASON)`	Not a blocker, just wait for the job to get scheduled

Scenario	Use This Skill?
User wants to compile TRT-LLM on a SLURM cluster	Yes
User is already on a compute node and wants to compile	No — use `exec-local-compile` skill instead

Scenario	Use This Skill?
User wants to compile TRT-LLM on a SLURM cluster	Yes
User is already on a compute node and wants to compile	No — use `exec-local-compile` skill instead

Exec Slurm Compile

Compile TensorRT-LLM on SLURM Cluster

When to Use

Finding the Docker Image

Exec Slurm Compile

Compile TensorRT-LLM on SLURM Cluster

When to Use

Finding the Docker Image

Pre-dumping the Container Image (enroot import)

enroot-import flags

enroot-import workflow

Prerequisites

Companion Scripts

Instructions

Step 0: Resolve the Container Image (if needed)

Step 1: Gather Information

Step 2: Prepare the Scripts Directory

Step 3: Submit the Job

Step 4: Monitor the Job (Proactive — Do NOT Wait for User)

Step 5: Verify the Build

Common Build Flags Reference

Troubleshooting

Example Interaction

Defi Amm Security

Nodejs Keccak256

Syncable Entity Builder And Validation

Nft Standards

Solidity Security

Defi Protocol Templates