Name: Debug Distributed Hang
Author: sgl-project

Debug Distributed Hang | Skills Pool

Scheduler watchdog timeout (self.watchdog_timeout=300, self.soft=False)

Thread (active): "MainThread"
    cuStreamSynchronize (libcuda.so)
    ...
    forward_extend (model_runner.py)

py-spy dump --pid <scheduler_pid>

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=COLL

export CUDA_ENABLE_USER_TRIGGERED_COREDUMP=1
export CUDA_COREDUMP_PIPE="/tmp/cuda_pipe_%h_%p"
export CUDA_COREDUMP_FILE="/tmp/cuda_coredump_%h_%p"
export CUDA_COREDUMP_SHOW_PROGRESS=1
export CUDA_COREDUMP_GENERATION_FLAGS='skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory'

ls /proc/<pid>/fd/ -la 2>/dev/null | grep cuda_pipe
dd if=/dev/zero bs=1M count=1 > /tmp/cuda_pipe_<hostname>_<pid>

Opening GPU coredump: <coredump_file>
[Current focus set to CUDA kernel 0, grid 622721, cluster (4,0,0), block (16,0,0), thread (64,0,0), device 0, sm 0, warp 0, lane 0]
#0  0x00007f8029b2b040 in ncclDevKernel_AllGather_RING_LL(ncclDevKernelArgsStorage<4096ul>)<<<(24,1,1),(512,1,1)>>> ()

import os

_debug_files = {}

def get_debug_file(rank):
    key = f"rank{rank}"
    if key not in _debug_files:
        _debug_files[key] = open(f"/tmp/debug_rank{rank}.log", "w")
    return _debug_files[key]

if os.environ.get("SGLANG_DEBUG_HANG"):
    f = get_debug_file(rank)
    f.write(f"EVENT_NAME key1={val1} key2={val2}\n")
    f.flush()

f.write(f"SCHED_BATCH step={step} num_reqs={n} extend_lens={lens}\n")
f.write(f"VERIFY predict_hash={hash} accept_len={alen}\n")
f.write(f"CACHE_INSERT rid={rid} num_tokens={n}\n")

import hashlib
h = hashlib.md5(tensor.cpu().numpy().tobytes()).hexdigest()[:8]
f.write(f"LOGITS logits_hash={h}\n")

h = hashlib.md5(str(tensor.tolist()).encode()).hexdigest()[:8]

# Extract specific event type
grep "^VERIFY" /tmp/debug_rank0.log > /tmp/v_r0.txt
grep "^VERIFY" /tmp/debug_rank1.log > /tmp/v_r1.txt
diff /tmp/v_r0.txt /tmp/v_r1.txt | head -20

grep -c "^VERIFY" /tmp/debug_rank*.log

f.write(
    f"OP_INPUTS input_a_hash={h_a} input_b_hash={h_b} "
    f"input_c_hash={h_c} input_d_hash={h_d}\n"
)

Technique	When to Use
py-spy dump	First step — see where each rank is stuck
`NCCL_DEBUG=INFO`	Identify which collective and sizes
CUDA coredump + `cuda-gdb`	See which GPU kernel is blocked
Per-rank log files	Compare rank states over time
Hash of tensors	Efficiently compare large tensors across ranks
`diff` on extracted events	Find the exact step of divergence
`broadcast(result, src=0)`	Fix floating-point or sampling non-determinism

Debug Distributed Hang

Debugging Distributed Hangs in SGLang

Overview

Prerequisites

Step 1: Confirm and Locate the Hang

Debug Distributed Hang

Debugging Distributed Hangs in SGLang

Overview

Prerequisites

Step 1: Confirm and Locate the Hang

1a. Watchdog / py-spy

1b. NCCL Debug Logging

1c. CUDA Coredump

1d. Identify the Collective

Step 2: Per-Rank Logging

Setup Pattern

What to Log

Hash Large Tensors

Avoid Implicit Synchronization

Step 3: Diff to Find the Diverge Point

Basic Diff

Count Events

Find First Diverge

Step 4: Binary-Search the Root Cause

4a. Identify Inputs

4b. Diff Inputs Across Ranks

4c. Recurse

Step 5: Common Root Causes and Fixes

Floating-Point Non-Determinism

Random Number Divergence

Conditional Code Paths

Pipeline Parallel (PP) Send/Recv Mismatch

Step 6: Verify the Fix

Quick Reference

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags