Analyze a jemalloc (or other) allocation profile in collapsed stack format. Use when the user wants to analyze memory allocations, find top allocators, or understand memory usage patterns from a .collapsed profile file.
Analyze an allocation profile file in collapsed stack format (as produced by jemalloc, async-profiler, perf). Each line has the form:
frame1;frame2;...;frameN VALUE
where VALUE is the number of bytes (or samples, depending on the profiler) attributed to that stack trace.
$0 (optional): Path to the .collapsed file. If not provided, search for .collapsed files in the current directory and ask the user to choose.Use Task tool with subagent_type=Bash to locate the file:
If $ARGUMENTS is provided, use it directly. Otherwise, run:
find . -maxdepth 3 -name "*.collapsed" -o -name "*.folded" | sort -t_ -k1,1
Report the candidates to the user and ask with AskUserQuestion:
Once the file path is known, pass it to all subsequent steps.
Launch the following three Task agents IN PARALLEL (single message, three tool calls) all with run_in_background: true.
Then call TaskOutput for ALL three agents (also in parallel, single message) before proceeding to Step 3.
Do NOT start Step 3 until every agent has finished.
Fallback: If any agent fails (e.g., reports lacking Bash permission), re-run its Python script directly using the Bash tool in the main context.
subagent_type=Bash)Run this Python script to compute summary statistics:
python3 - <<'EOF'
import sys, os, re
filepath = "PATH_TO_FILE" # substituted by skill
lines = open(filepath).read().splitlines()
traces = []
for line in lines:
line = line.strip()
if not line:
continue
parts = line.rsplit(' ', 1)
if len(parts) != 2:
continue
try:
traces.append((int(parts[1]), parts[0]))
except ValueError:
continue
total = sum(v for v, _ in traces)
traces.sort(reverse=True)
# Noise filters — keep in sync with Agent C
JEMALLOC_PREFIXES = (
"prof_backtrace", "prof_alloc_prep", "prof_tctx", "prof_",
"imalloc", "ialloc", "irallocx", "imallocx",
"arena_malloc", "arena_palloc", "arena_ralloc", "arena_",
"tcache_alloc", "tcache_",
"large_malloc", "large_palloc",
"chunk_alloc", "huge_malloc", "huge_palloc",
"je_malloc", "je_calloc", "je_realloc", "je_rallocx", "je_mallocx",
"je_posix_memalign", "je_aligned_alloc",
"malloc_default", "calloc",
)
ALLOC_SUBSTRINGS = (
"operator new", "operator new[]",
"__libcpp_operator_new",
"__libc_malloc", "__libc_calloc", "_int_malloc",
"posix_memalign", "aligned_alloc",
"do_rallocx", "do_mallocx",
"mi_malloc", "mi_calloc",
"__cxx_global_var_init", "__cxa_thread_atexit_impl",
"DB::Memory<", "Memory::newImpl", "Allocator<false", "Allocator<true",
"allocNoTrack",
"PODArrayBase::realloc", "PODArrayBase::alloc",
"CRYPTO_malloc",
"std::__detail::_Hash_node", "std::_Rb_tree",
"std::vector<", "std::string::",
# STL and PODArray wrappers — noise for leaf analysis
"std::__1::",
"DB::PODArrayBase",
)
def is_noise(frame):
return (any(frame.startswith(p) for p in JEMALLOC_PREFIXES) or
any(s in frame for s in ALLOC_SUBSTRINGS))
def shorten(frame):
return re.sub(r'<[^>]{40,}>', '<...>', frame)
print(f"=== SUMMARY ===")
print(f"File: {filepath}")
print(f"Total allocated: {total:,} bytes ({total/1024/1024:.2f} MB) ({total/1024/1024/1024:.3f} GB)")
print(f"Unique stack traces: {len(traces)}")
print()
print("=== TOP 25 STACK TRACES ===")
for i, (v, stack) in enumerate(traces[:25], 1):
frames = [f for f in stack.split(';') if f]
meaningful = [f for f in frames if not is_noise(f)]
tail_frames = meaningful[-4:] if meaningful else frames[-4:]
tail = ' <- '.join(shorten(f) for f in reversed(tail_frames))
print(f"{i:>3}. {v/1024/1024:>8.2f} MB ({100*v/total:>5.1f}%) {tail[:120]}")
print()
print("=== FULL STACKS FOR TOP 10 ===")
for i, (v, stack) in enumerate(traces[:10], 1):
frames = [f for f in stack.split(';') if f]
print(f"\n--- #{i}: {v/1024/1024:.2f} MB ({100*v/total:.1f}%) ---")
for depth, frame in enumerate(reversed(frames), 1):
noise_mark = " [noise]" if is_noise(frame) else ""
print(f" [{depth:>2}] {shorten(frame)}{noise_mark}")
EOF
subagent_type=Bash)Run this Python script to aggregate by the outermost (shallowest) meaningful frame — the operation that initiated the allocation. This answers "why did this allocation happen?" (e.g., loading data parts, executing a query, loading a dictionary), complementing Agent C which answers "what code allocated?":
python3 - <<'EOF'
import sys, re
from collections import defaultdict
filepath = "PATH_TO_FILE" # substituted by skill
lines = open(filepath).read().splitlines()
traces = []
for line in lines:
line = line.strip()
if not line:
continue
parts = line.rsplit(' ', 1)
if len(parts) != 2:
continue
try:
traces.append((int(parts[1]), parts[0]))
except ValueError:
continue
total = sum(v for v, _ in traces)
# Frames to skip when looking for the outermost meaningful frame:
# thread pool scaffolding, libc entry points, raw addresses, lambda wrappers
SKIP_OUTER = (
"0000", "_start", "__libc_start", "__GI___clone",
"start_thread", "clone3",
"ThreadPoolImpl", "ThreadFromGlobalPool",
"std::__1::__function", "std::__1::__invoke",
"decltype", "void std::__1::__function",
"std::__1::__packaged_task_function",
"DB::ThreadPool", "DB::GlobalThreadPool",
"DB::threadFunction",
"BaseDaemon", "SignalListener",
"Poco::ThreadImpl::runnableEntry",
"Poco::PooledThread::run",
"main",
"DB::Server::run",
"Poco::Util::Application::run",
)
def is_skip_outer(frame):
return any(frame.startswith(p) for p in SKIP_OUTER) or frame.startswith("(")
def shorten(frame):
# Collapse long templates, preserve (anonymous namespace), strip args
s = re.sub(r'<[^>]{40,}>', '<...>', frame)
s = s.replace('(anonymous namespace)', '{anon}')
s = re.sub(r'\(.*', '', s)
s = s.replace('{anon}', '(anonymous namespace)')
return s[:120]
by_outer = defaultdict(int)
for v, stack in traces:
frames = [f for f in stack.split(';') if f]
outer = None
for f in frames:
if not f or is_skip_outer(f):
continue
outer = f
break
if outer is None:
outer = frames[0] if frames else "(unknown)"
by_outer[shorten(outer)] += v
print("=== TOP 25 OUTERMOST MEANINGFUL FRAMES (operation that initiated allocation) ===")
for fn, v in sorted(by_outer.items(), key=lambda x: -x[1])[:25]:
mb = v / 1024 / 1024
pct = 100 * v / total
bar = "\u2588" * int(pct / 2)
print(f" {mb:>10.2f} MB {pct:>5.1f}% {bar:<20} {fn}")
EOF
subagent_type=Bash)Run this Python script to aggregate by the deepest (innermost) frame — the actual allocation call:
python3 - <<'EOF'
import sys, re
from collections import defaultdict
filepath = "PATH_TO_FILE" # substituted by skill
lines = open(filepath).read().splitlines()
traces = []
for line in lines:
line = line.strip()
if not line:
continue
parts = line.rsplit(' ', 1)
if len(parts) != 2:
continue
try:
traces.append((int(parts[1]), parts[0]))
except ValueError:
continue
total = sum(v for v, _ in traces)
# Aggregate by last meaningful frame (the allocating function)
by_leaf = defaultdict(int)
by_caller = defaultdict(int) # caller of the leaf
# jemalloc profiling infrastructure — always at the bottom of every stack
JEMALLOC_PREFIXES = (
"prof_backtrace", "prof_alloc_prep", "prof_tctx", "prof_",
"imalloc", "ialloc", "irallocx", "imallocx",
"arena_malloc", "arena_palloc", "arena_ralloc", "arena_",
"tcache_alloc", "tcache_",
"large_malloc", "large_palloc",
"chunk_alloc", "huge_malloc", "huge_palloc",
"je_malloc", "je_calloc", "je_realloc", "je_rallocx", "je_mallocx",
"je_posix_memalign", "je_aligned_alloc",
"malloc_default", "calloc",
)
# libc / C++ allocator wrappers that add no information
ALLOC_SUBSTRINGS = (
"operator new", "operator new[]",
"__libcpp_operator_new",
"__libc_malloc", "__libc_calloc", "_int_malloc",
"posix_memalign", "aligned_alloc",
"do_rallocx", "do_mallocx",
"mi_malloc", "mi_calloc",
# C++ static/thread-local initialization wrappers
"__cxx_global_var_init", "__cxa_thread_atexit_impl",
# ClickHouse allocator wrappers — informative only as callers, not as leaf
"DB::Memory<", "Memory::newImpl", "Allocator<false", "Allocator<true",
"allocNoTrack",
"PODArrayBase::realloc", "PODArrayBase::alloc",
# Third-party allocators
"CRYPTO_malloc",
# STL internals
"std::__detail::_Hash_node", "std::_Rb_tree",
"std::vector<", "std::string::",
# STL and PODArray wrappers — noise for leaf analysis
"std::__1::",
"DB::PODArrayBase",
)
def is_noise(frame):
return (any(frame.startswith(p) for p in JEMALLOC_PREFIXES) or
any(s in frame for s in ALLOC_SUBSTRINGS))
def meaningful_leaf(frames):
# Walk from innermost (last) frame upward, skipping allocator/profiling noise.
# In jemalloc collapsed format frames are outermost-first, so the bottom of
# the stack (profiling infra + raw allocators) is at the end of the list.
for f in reversed(frames):
if f and not is_noise(f):
return f
return frames[-1] if frames else "(unknown)"
def meaningful_caller(frames):
"""Second non-noise frame from the bottom."""
found_leaf = False
for f in reversed(frames):
if f and not is_noise(f):
if found_leaf:
return f
found_leaf = True
return None
def shorten(frame):
s = re.sub(r'<[^>]{40,}>', '<...>', frame)
s = s.replace('(anonymous namespace)', '{anon}')
s = re.sub(r'\(.*', '', s)
s = s.replace('{anon}', '(anonymous namespace)')
return s[:120]
for v, stack in traces:
frames = [f for f in stack.split(';') if f]
leaf = meaningful_leaf(frames)
by_leaf[shorten(leaf)] += v
caller = meaningful_caller(frames)
if caller:
by_caller[shorten(caller)] += v
print("=== TOP 25 ALLOCATING FUNCTIONS (first non-trivial frame from bottom) ===")
for label, bucket in [("Leaf (allocator call site)", by_leaf),
("Caller of leaf", by_caller)]:
print(f"\n--- {label} ---")
for fn, v in sorted(bucket.items(), key=lambda x: -x[1])[:25]:
mb = v / 1024 / 1024
pct = 100 * v / total
print(f" {mb:>8.2f} MB {pct:>5.1f}% {fn}")
EOF
MANDATORY: All three agents from Step 2 must have completed (TaskOutput returned) before this step.
Using the outputs from Agent A (top stacks with full traces), Agent B (outermost frame — the initiating operation), and Agent C (leaf function — the allocating code), you (the main LLM) produce a structured report. Agent B gives you the "why" (what operation triggered allocations) and Agent C gives you the "how" (what code did the allocating). Combined with Agent A's full stacks, you can semantically group allocations into subsystems — e.g., AggregatedDataVariants::init called from HashedDictionary::loadData is "Dictionary Loading", not "Aggregation"; Arena::addMemoryChunk inside a merge pipeline is "Merges", not "Arena".
Your report should include:
do_rallocx / PODArray::realloc heavy)After presenting the summary, use AskUserQuestion:
Question: "What would you like to do next?"
Option 1: "Drill into a specific subsystem"
Description: "Show all stack traces for a chosen component (e.g., MergeTree, SystemLog)"
→ Ask which subsystem with a follow-up AskUserQuestion
→ Launch Task (subagent_type=Bash) in the background (run_in_background: true) with a Python script that:
MB (pct%) | frame1 ← frame2 ← ... ← frameNgeneral-purpose Task agent for a concise summaryOption 2: "Show full stacks for top N traces"
Description: "Print complete call stacks for the largest N allocations"
→ Ask N with a follow-up AskUserQuestion (suggest 10 as default)
→ Launch Task (subagent_type=Bash) in the background with a Python script that:
general-purpose Task agent for a concise narrative summaryOption 3: "Search for a keyword in stacks"
Description: "Filter traces containing a specific function or class name"
→ Ask for the keyword via AskUserQuestion
→ Launch two Task agents in parallel (run_in_background: true):
subagent_type=Bash): filter and aggregate all matching traces — sum total, count, top 20 by size, full stacks for top 5subagent_type=Bash): find related keywords by scanning all frames containing the keyword and extracting their neighboring frames (co-occurring functions), to suggest related call paths
→ Use TaskOutput (both) then pass combined output to a general-purpose Task agent for synthesisOption 4: "Generate flamegraph SVG"
Description: "Render an SVG flamegraph using flamegraph.pl (must be installed)"
→ Launch Task (subagent_type=Bash) in the background:
flamegraph.pl --title "Allocation Profile" --countname bytes --width 1800 \
PATH_TO_FILE > /tmp/alloc_flamegraph.svg
→ Use TaskOutput to wait for completion
→ Report the output path /tmp/alloc_flamegraph.svg and remind user to open it in a browser
Option 5: "Done" Description: "Exit without further analysis"
IMPORTANT: For every drill-down option (1–4):
run_in_background: true and wait with TaskOutputgeneral-purpose Task agent for a concise, human-readable summary before showing it to the userRepeat drill-down (return to the AskUserQuestion) until user selects "Done".
jeprof --demangle or pipe through c++filtpython3 -/alloc-profile — Find .collapsed files and prompt for selection/alloc-profile jemalloc-profile-2026-02-19T13-08-59-825Z.collapsed — Analyze a specific file/alloc-profile /tmp/prod-heap-dump.collapsed — Analyze an absolute path