Run, benchmark, and serve local GGUF / llama.cpp models on AMD Strix Halo systems such as Framework Desktop using kyuz0 toolboxes, unified-memory sizing, and reproducible podman commands. Use this skill for Strix Halo local LLM work, especially llama.cpp setup, memory-fit or performance tuning, toolbox benchmarking, and podman-based serving.
Use this skill to help users run, benchmark, and serve GGUF models on AMD Strix Halo hardware.
The core pattern is:
This keeps experimentation and operations separate.
Start by choosing the right backend.
| Backend | When to prefer it | Notes |
|---|---|---|
vulkan-radv | Default recommendation | Best balance of compatibility and simplicity. Use this unless the user specifically needs max BF16 throughput. |
rocm-7.2 or similar ROCm toolbox | User wants the fastest BF16 path | More moving parts than Vulkan, but generally better BF16 throughput on Strix Halo. |
vulkan-amdvlk |
| Only if the user explicitly wants to try it |
| Can be fast, but large models may fail because of the single-buffer allocation limit. |
If the user is unsure, recommend vulkan-radv first.
Follow this sequence.
Figure out whether the user wants:
If the user says "optimal", clarify whether they mean quality, latency, or operational simplicity.
Recommend the toolbox workflow for one-off checks and tuning:
toolbox create llama-vulkan-radv \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
-- --device /dev/dri --group-add video --security-opt seccomp=unconfined
toolbox enter llama-vulkan-radv
If toolbox enter has terminal issues, use non-interactive commands instead:
toolbox run -c llama-vulkan-radv llama-cli --list-devices
Use toolbox for:
llama-cli --list-devicesgguf-vram-estimator.pyllama-benchllama-cli testsDo not position toolbox as the final serving story if the user wants something reproducible. For that, prefer podman run.
Run:
toolbox run -c llama-vulkan-radv llama-cli --list-devices
Important interpretation notes:
uma: 1 is expected and good on Strix Halobf16: 0 on Vulkan RADV does not mean BF16 weights cannot runAlways use the estimator before recommending large contexts or BF16 on big models:
toolbox run -c llama-vulkan-radv \
gguf-vram-estimator.py /path/to/model-or-first-shard.gguf --contexts 16384 32768 65536 131072
Rules:
Safe default guidance:
Use llama-bench to measure both short generation and prefilled-context behaviour.
If the benchmark is likely to run for a while, or if you want to poll output safely without blocking the main session, invoke the tmux skill if available and run the benchmark there.
Short benchmark:
toolbox run -c llama-vulkan-radv \
llama-bench -m /path/to/model.gguf -p 512 -n 128 -ngl 999 -fa 1 -mmp 0 -r 1 -o md
Long-context benchmark:
toolbox run -c llama-vulkan-radv \
llama-bench -m /path/to/model.gguf -p 2048 -n 32 -d 16384 -ngl 999 -fa 1 -mmp 0 -r 1 -o md
Use these numbers to explain trade-offs clearly.
Benchmarking rules:
llama-server process running while starting another heavy benchmark unless the user explicitly wants thatFor reproducible serving, prefer a container like this:
podman run -d \
--restart=always \
--name=my-model \
--device /dev/dri \
--group-add video \
--security-opt seccomp=unconfined \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/.cache/llama.cpp:/root/.cache/llama.cpp \
-p 8080:8080 \
docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
llama-server \
-hf user/model:QUANT \
--host 0.0.0.0 \
--ctx-size 16384 \
--no-mmap \
-ngl 999 \
-fa on \
--jinja \
-a my-model
Add model-specific flags as needed.
llama-server supports --api-key KEY and --api-key-file FNAME. The built-in Web UI uses Authorization: Bearer <key> and stores the key in browser localStorage; x-api-key appears in Anthropic-compatible API examples, not in the built-in Web UI.
These are the baseline defaults to reach for on Strix Halo.
--no-mmap-ngl 999-fa on--ctx-size 16384 or 32768 unless the estimator supports more comfortablyTreat these as the default baseline for large GGUFs on this hardware.
If the repo exposes an mmproj file but the user only wants text generation, add:
--no-mmproj
This avoids unnecessary memory use and makes the deployment intent explicit.
Keep the main skill general. For concrete model recipes and worked examples, read the relevant reference file when needed.
references/qwen3-5.mdUse the reference file when the user explicitly asks about Qwen3.5, Qwen3.5 GGUFs, thinking toggles, or wants a concrete Strix Halo command for that model family.
After the first download, prefer an explicit cached model path over -hf when the user wants deterministic local runs or offline usage:
-m ~/.cache/huggingface/hub/.../model.gguf
This avoids surprises from repo preset lookups and makes it obvious which exact shard set is being loaded.
Use these heuristics when explaining trade-offs.
Treat BF16 on vulkan-radv as:
From the tested setup behind this skill, unsloth/Qwen3.5-35B-A3B-GGUF:BF16 on Vulkan RADV was about 10 to 11 tok/s, while the exact same model family in Q4_K_M was roughly 4x faster and much smaller in memory footprint. Treat that as a useful reference example for explaining BF16 vs quantized trade-offs on Strix Halo. See references/qwen3-5.md for the full worked example.
Do not overgeneralize those exact numbers. Use the pattern:
If the user wants maximum BF16 throughput, tell them ROCm is worth considering. Phrase it as a trade-off:
For very large models, success depends not just on model size but also on host unified-memory tuning.
Do not hardcode kernel or firmware advice unless you have current confirmation.
Instead:
kyuz0/amd-strix-halo-toolboxes README for the latest host configuration guidanceKeep answers practical and command-first:
podman run command--no-mmap, -fa on, -ngl 999, --no-mmproj if applicable, and any model-specific togglestmux skill if available and avoid concurrent heavy runs