技能档案

Strix Halo Llm

Name: Strix Halo Llm
Author: jingkaihe

Run, benchmark, and serve local GGUF / llama.cpp models on AMD Strix Halo systems such as Framework Desktop using kyuz0 toolboxes, unified-memory sizing, and reproducible podman commands. Use this skill for Strix Halo local LLM work, especially llama.cpp setup, memory-fit or performance tuning, toolbox benchmarking, and podman-based serving.

jingkaihe0 星标2026年4月6日

职业
分类: 框架内部

技能内容

Strix Halo LLMs with llama.cpp

Use this skill to help users run, benchmark, and serve GGUF models on AMD Strix Halo hardware.

The core pattern is:

use toolbox for discovery, benchmarking, and memory estimation
use podman for reproducible long-running serving

This keeps experimentation and operations separate.

Backend choice

Start by choosing the right backend.

Backend	When to prefer it	Notes
`vulkan-radv`	Default recommendation	Best balance of compatibility and simplicity. Use this unless the user specifically needs max BF16 throughput.
`rocm-7.2` or similar ROCm toolbox	User wants the fastest BF16 path	More moving parts than Vulkan, but generally better BF16 throughput on Strix Halo.
`vulkan-amdvlk`

相关技能

Strix Halo Llm | Skills Pool

toolbox create llama-vulkan-radv \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
  -- --device /dev/dri --group-add video --security-opt seccomp=unconfined

toolbox enter llama-vulkan-radv

toolbox run -c llama-vulkan-radv llama-cli --list-devices

toolbox run -c llama-vulkan-radv llama-cli --list-devices

toolbox run -c llama-vulkan-radv \
  gguf-vram-estimator.py /path/to/model-or-first-shard.gguf --contexts 16384 32768 65536 131072

toolbox run -c llama-vulkan-radv \
  llama-bench -m /path/to/model.gguf -p 512 -n 128 -ngl 999 -fa 1 -mmp 0 -r 1 -o md

toolbox run -c llama-vulkan-radv \
  llama-bench -m /path/to/model.gguf -p 2048 -n 32 -d 16384 -ngl 999 -fa 1 -mmp 0 -r 1 -o md

podman run -d \
  --restart=always \
  --name=my-model \
  --device /dev/dri \
  --group-add video \
  --security-opt seccomp=unconfined \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v ~/.cache/llama.cpp:/root/.cache/llama.cpp \
  -p 8080:8080 \
  docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
  llama-server \
    -hf user/model:QUANT \
    --host 0.0.0.0 \
    --ctx-size 16384 \
    --no-mmap \
    -ngl 999 \
    -fa on \
    --jinja \
    -a my-model

--no-mmproj

-m ~/.cache/huggingface/hub/.../model.gguf

Strix Halo Llm

Strix Halo LLMs with llama.cpp

Backend choice

Strix Halo Llm

Strix Halo LLMs with llama.cpp

Backend choice

Workflow

1. Confirm the user goal

2. Use toolbox for discovery and experiments

3. Check device visibility

4. Estimate memory before committing to context size

5. Benchmark before calling something "optimal"

6. Use podman for the final serving recommendation

llama-server auth and Web UI notes

Strix Halo llama.cpp defaults

Core flags

Text-only multimodal repos

Model-specific references

Deterministic offline runs

Performance heuristics

BF16 on Vulkan RADV

ROCm guidance

System-level caveats

Response pattern

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2