Running jobs on the Janelia cluster
From outside the cluster (ssh wrapper)
ssh login1 'bash -l -c "bjobs ..."'
Running a job on a GPU node:
bsub -n $SLOTS -gpu "num=1" -q $QUEUE -W $MINUTES -R "affinity[core(1)]" -J $JOBNAME -o $LOGFILE $COMMAND
Job status
bjobsbjobs -l $JOBIDbjobs -o "job_name stat exec_host" -noheaderKill jobs
bkill $JOBID — kill onebkill 0 — kill all yoursbkill -J "jobname_*" 0 — kill by name patternLaunch interactive job:
ssh -Y login1 'bsub -XF -Is -n 8 -gpu "num=1" -q gpu_a100 -W 48:00 /bin/bash'
$SLOTS
$QUEUE
$MINUTES
$JOBNAME
bjobs| Queue | GPU | VRAM | Price/GPU/hr | Slots/GPU | RAM/slot |
|---|---|---|---|---|---|
| gpu_a100 | A100 | 80GB | $0.20 | 12 | 40GB |
| gpu_l4 | L4 | 24GB | $0.10 | 8 | 15GB |
| gpu_l4_16 | L4 | 24GB | $0.10 | 16 | 15GB |
| gpu_l4_large | L4 | 24GB | $0.10 | 64 | 15GB |
| gpu_h100 | H100 | 80GB | $0.50 | 12 | 40GB |
| gpu_h200 | H200 | 141GB | $0.80 | 12 | 40GB |
| gpu_t4 | T4 | 16GB | $0.10 | 48 | 15GB |
| gpu_short | All | - | $0.10 | 8 | 15GB |
| Queue | Runtime Limit | Description |
|---|---|---|
| interactive | Default 8h, max 48h | GUI/interactive apps. Limit: 128 slots or 4 jobs per user |
| local | 14 days | Default for jobs without runtime. CPU-optimized nodes. Limit: 5999 slots per user |
| short | 1 hour | Jobs < 1 hour. No slot limit per user. Gets priority scheduling |
-W sets hard runtime (minutes or HH:MM format); -We sets estimate-W specified| Option | Description |
|---|---|
-J <name> | Job name (avoid: usernames, spaces, "spark", "janelia", "master", "int") |
-n <slots> | Number of slots (1-128). Env var: LSB_DJOB_NUMPROC |
-o <file> | Stdout file (suppresses email notification) |
-e <file> | Stderr file |
-W <min> | Hard runtime limit (minutes or HH:MM) |
-We <min> | Runtime estimate (helps scheduler, won't kill job) |
| Setting | Description | Janelia Notes |
|---|---|---|
num=num_gpus | Number of GPUs | Max = GPUs per host |
mode=shared|exclusive_process | GPU sharing mode | Default: exclusive_process |
mps=yes|no | Multi-Process Service | Default: no (bugs in the past) |
j_exclusive=yes|no | Exclusive GPU access | Do not change; always exclusive |
gmodel=full_model_name | Request specific GPU model | Only needed for gpu_short; use full model name |
gmem=mem_value | Minimum GPU memory | Use with gpu_short only; e.g. gmem=16G |
nvlink=yes | Require NVLink | Not needed; A100/H100/H200 always have nvlink |
Default -gpu settings: "num=1:mode=exclusive_process:mps=no:j_exclusive=yes"
| Type | Description |
|---|---|
| Batch | Single segment, executed once |
| Array | Parallel independent tasks with same workload |
| Parallel | Cooperating tasks (MPI), must run simultaneously |
| Interactive | User login to compute node |
# Single-threaded
bsub -n 1 -J <name> -o /dev/null 'command > output'
# Multi-threaded
bsub -n <1-128> -J <name> -o /dev/null 'command > output'
bsub -n <slots> -J "jobname[1-n]" -o /dev/null 'command file.$LSB_JOBINDEX > output.$LSB_JOBINDEX'
Limit concurrent members with %val:
bsub -J "myArray[1-1000]%15" /path/to/mybinary input.$LSB_JOBINDEX
Max array size: 1 million elements.
By default the submitting environment is passed to the job.
| Variable | Description |
|---|---|
$LSB_JOBID | Job ID number |
$LSB_JOBINDEX | Array Task Index |
$LSB_JOBINDEX_STEP | Array step value |
$LSB_BATCH_JID | Combined JobID and Array Index |
$LSB_DJOB_NUMPROC | Value of -n (slots) |
$LSB_JOBNAME | Value of -J (job name) |
If job errors about /run/user/<userid>, fix with unset XDG_RUNTIME_DIR before submitting or inside the job.
bjobs or bjobs -u all to see jobsRUN, PEND, UNKNOWN| Task | Command |
|---|---|
| Delete all your jobs | bkill 0 |
| Delete individual job | bkill <job id> |
| Delete array job | bkill <job id> |
| Delete single array task | bkill "<job id>[<task#>]" |
| Delete range of tasks | bkill "12354[1-15, 321, 500-600]" |
| Delete by job name | bkill -J <jobname> 0 |
| Delete by queue | bkill -q <queue> 0 |
lsload -gpuload <hostname>
gpu_ut = processing utilizationCUDA_VISIBLE_DEVICES_ORIG gives gpuid inside jobbjobs -l <jobid> shows GPU assignment in EXTERNAL MESSAGESRequest slots matching the ratio in the GPU table. Over-requesting strands GPUs.
| Path | Backed up | Notes |
|---|---|---|
/groups/ | Yes (nightly, 30-day offsite) | Primary storage for scientific data |
/nrs/ | No | Cheaper tier for computationally reproducible data |
/scratch/$USER/ | No | Node-local SSD, ~25GB/slot, clean up after job |
/tmp/ | No | Do not use; use /scratch/ instead |
dtn.int.janelia.org (for copying to/from nearline)--nv flag for GPU access: singularity exec --nv -B /groups -B /nrs -B /scratch image.sif command/usr/local/ on all compute nodes/usr/local/cuda)/usr/local/cuda-11 and /usr/local/cuda-12module load cuda-<version>