Name: Coredump Debug
Author: AMD-AGI

Coredump Debug

Debug segfaults and crashes in JAX/XLA/ROCm training workloads using coredump analysis. Use when the user has a coredump file, SIGSEGV, segfault, crash dump, or core file to analyze. Covers GDB backtrace extraction, identifying the crash cause from registers and disassembly, finding and cloning the correct source code versions, and reading the relevant code to determine the root cause.

AMD-AGI27 Sterne05.03.2026

Beruf
Kategorien: Debugging

Coredump Debugging for JAX/XLA/ROCm

Systematic workflow for analyzing coredumps from GPU training workloads. Produces: crash call chain, root cause hypothesis, and the exact source lines responsible.

Important: Coredumps capture the crash symptom, not necessarily the root cause. A common pattern in GPU workloads: a data race silently corrupts memory during normal operation, and the corruption only manifests as a crash much later (e.g., during exit cleanup). If the crash is non-deterministic (e.g., ~2% repro rate), suspect a data race or thread-safety bug — the coredump shows where the corrupted data was read, but the write that caused corruption happened earlier. ASAN/TSAN may be needed to find the actual root cause.

Step 0: Determine if the Crash is Deterministic

Before diving into GDB, establish the repro rate:

100% repro → likely a logic bug, NULL deref, or missing check. Coredump analysis alone is usually sufficient.
Low repro (1-10%) → likely a data race, use-after-free, or shutdown ordering issue. The coredump identifies the symptom; sanitizers (ASAN/TSAN) identify the cause.

Coredump Debugging for JAX/XLA/ROCm

Systematic workflow for analyzing coredumps from GPU training workloads. Produces: crash call chain, root cause hypothesis, and the exact source lines responsible.

Step 0: Determine if the Crash is Deterministic

Before diving into GDB, establish the repro rate:

100% repro → likely a logic bug, NULL deref, or missing check. Coredump analysis alone is usually sufficient.
Low repro (1-10%) → likely a data race, use-after-free, or shutdown ordering issue. The coredump identifies the symptom; sanitizers (ASAN/TSAN) identify the cause.

Frame	Function	Source File	Component
#0	`syscall()`	libc	Signal delivery
#1	`SignalHandler(Sig=11)`	LLVM	Signal re-raise
#2	`<signal handler called>`	-	-
#3	`??()` from `libsomething.so`	unknown	Crash origin
...	...	...	...

Library in backtrace	Source Repo	Typical Path (example)
`librocprofiler-sdk.so`	`ROCm/rocm-systems`	`/workspace/rocm-systems/projects/rocprofiler-sdk`
`libamdhip64.so` (CLR)	`ROCm/rocm-systems`	`/workspace/rocm-systems/projects/clr`
`libhsa-runtime64.so`	`ROCm/rocm-systems`	`/workspace/rocm-systems/projects/rocr-runtime`
`librccl.so`	`ROCm/rccl`	`/workspace/rccl`
`libhipblaslt.so`	`ROCm/rocm-libraries`	`/workspace/rocm-libraries/projects/hipblaslt`
`xla_rocm_plugin.so`	`ROCm/xla`	`/opt/xla`

Pattern	What to look for in coredump	Typical repro rate
Data race (most common for non-deterministic)	Misaligned/garbage pointers, corrupted container internals, `rlock` guarding write operations	1-10%
Use-after-free	Pointer to freed region, `fd` bytes in ASAN	1-50%
Shutdown ordering	Crash in destructor during `__run_exit_handlers`	1-20%
NULL deref	`rdi=0x0`, `rax=0x0` before a `mov` through pointer	~100%
Stack overflow	Thousands of recursive frames, `rsp` near stack limit	~100%

Command	Purpose
`bt full`	Full backtrace with local variables
`frame N`	Switch to frame N
`info locals`	Show local variables in current frame
`info registers`	Show CPU registers
`info threads`	List all threads
`thread N`	Switch to thread N
`thread apply all bt`	Backtrace for every thread
`x/10i $rip`	Disassemble 10 instructions at crash point
`x/s ADDR`	Print string at address
`p expr`	Evaluate expression
`info proc mappings`	Show library load addresses

Exit Code	Signal	Meaning
139	SIGSEGV (11)	Segmentation fault (invalid memory access)
134	SIGABRT (6)	Abort (assertion failure, double free)
136	SIGFPE (8)	Floating point exception (division by zero)
137	SIGKILL (9)	Killed (OOM killer, timeout)
138	SIGBUS (7)	Bus error (misaligned access, bad mmap)

Coredump Debug

Coredump Debugging for JAX/XLA/ROCm

Step 0: Determine if the Crash is Deterministic

Coredump Debug

Coredump Debugging for JAX/XLA/ROCm

Step 0: Determine if the Crash is Deterministic

Prerequisites

Phase 1: Extract the Crash Backtrace

1.1 Full backtrace with locals

1.2 Identify the crash signal and thread

1.3 Get the crashing instruction

1.4 Get detailed frame info for key frames

Phase 2: Read the Crash Call Chain

2.1 Build the crash chain table

2.2 Identify the crash context

2.3 Check for thread-safety clues

Phase 3: Find and Clone the Correct Source Code

3.1 Inventory the container environment

3.2 Identify which repos are needed

3.3 Find the matching version for system packages

3.4 Clone missing repos

3.5 Handle long build-path prefixes

3.6 Verify upstream fixes

Phase 4: Read Source Code and Determine Root Cause

4.1 Read the crashing function

4.2 Check for common crash patterns

4.3 Trace the data lifecycle

4.4 Check for existing fixes upstream

Output Format

Quick Reference: GDB Commands for Coredumps

Quick Reference: Crash Exit Codes

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags