Upgrade vllm-omni NPU model runners (OmniNPUModelRunner, NPUARModelRunner, NPUGenerationModelRunner) to align with the latest vllm-ascend NPUModelRunner while preserving omni-specific logic.
This skill guides the process of upgrading vllm-omni's NPU model runners to align with the latest vllm-ascend codebase while preserving omni-specific enhancements. The NPU runners are designed to run omni multimodal models (like Qwen3-Omni, Bagel, MiMoAudio) on Ascend NPUs.
vllm-omni/vllm_omni/platforms/npu/worker/
├── __init__.py
├── npu_model_runner.py # OmniNPUModelRunner (base class)
├── npu_ar_model_runner.py # NPUARModelRunner (autoregressive)
├── npu_ar_worker.py # AR worker
├── npu_generation_model_runner.py # NPUGenerationModelRunner (diffusion/non-AR)
└── npu_generation_worker.py # Generation worker
vllm-omni/vllm_omni/worker/
├── __init__.py
├── gpu_model_runner.py # OmniGPUModelRunner
├── gpu_ar_model_runner.py # GPUARModelRunner
├── gpu_ar_worker.py
├── gpu_generation_model_runner.py
├── gpu_generation_worker.py
├── mixins.py
├── base.py
└── gpu_memory_utils.py
vllm-ascend/vllm_ascend/worker/
├── model_runner_v1.py # NPUModelRunner (base class to copy from)
├── npu_input_batch.py
├── block_table.py
├── pcp_utils.py
└── worker.py
GPUModelRunner (vllm)
|
+----------------+----------------+
| |
OmniGPUModelRunner NPUModelRunner (vllm-ascend)
(vllm_omni/worker) (vllm_ascend/worker)
| |
+----------- OmniNPUModelRunner --+
(multiple inheritance)
|
+---------------+---------------+
| |
NPUARModelRunner NPUGenerationModelRunner
(autoregressive) (non-autoregressive/diffusion)
Omni-specific logic is marked with comment blocks:
# -------------------------------------- Omni-new -------------------------------------------------
# ... omni-specific code ...
# -------------------------------------- Omni-new -------------------------------------------------
Or simpler variations:
# -------------------------------------- Omni-new -------------------------------------------------
# ------------------------------------------------------------------------------------------------
Important:
references/omni-specific-blocks.md) may not be up-to-date. Always grep for Omni-new in the GPU implementations to find the authoritative list of omni-specific blocks.| Method | Description | Omni-Specific Logic |
|---|---|---|
load_model | Load model and initialize talker_mtp | Uses ACLGraphWrapper instead of CUDAGraphWrapper, initializes talker buffers |
_dummy_run | Warmup/profiling run | talker_mtp dummy forward, extract_multimodal_outputs |
_model_forward | Forward pass wrapper | Injects model_kwargs_extra, wraps with OmniOutput, NPU-specific graph updates |
_talker_mtp_forward | Talker MTP forward for Qwen3-Omni | Uses set_ascend_forward_context |
| Method | Description | Omni-Specific Logic |
|---|---|---|
__init__ | Initialize with KV transfer manager | OmniKVTransferManager setup |
execute_model | Main inference entry | KV transfer handling, _update_states override, extract_multimodal_outputs |
sample_tokens | Token sampling | Hidden states extraction, multimodal outputs processing, OmniModelRunnerOutput |
_resolve_global_request_id | Request ID resolution | For disaggregated inference |
| Method | Description | Omni-Specific Logic |
|---|---|---|
_update_request_states | Update request states for async chunk | async_chunk handling |
execute_model | Generation forward | async_chunk, seq_token_counts, _run_generation_model |
sample_tokens | Output processing | multimodal output packaging to OmniModelRunnerOutput |
_dummy_run | Dummy run override | model_kwargs initialization, multimodal extraction |
_run_generation_model | Run generation model | Calls _model_forward with sampler |
Identify target versions(Use gh cli to check):
Check GPU-side changes (since last release):
cd /root/vllm-workspace/vllm-omni
git log --oneline --since="<last-release-date>" -- vllm_omni/worker/
Read latest vllm-ascend code:
/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.pyFor each NPU model runner file:
Extract existing omni-specific blocks:
grep -n "Omni-new" vllm_omni/platforms/npu/worker/npu_model_runner.py
Document each omni block:
Note: Always check the GPU implementation gpu_model_runner.py for any new omni logic not yet documented in references.
Read the latest vllm-ascend NPUModelRunner.load_model
Copy the method, keeping the structure
Re-insert omni-specific logic (check GPU gpu_model_runner.py for authoritative list):
CUDAGraphWrapper with ACLGraphWrapperUpdate _dummy_run:
_dummy_run for omni-specific blocksOmni-new marked code from GPU versionUpdate _model_forward:
Compare with GPU gpu_ar_model_runner.py for any new omni features
Copy execute_model from vllm-ascend
Re-insert omni blocks (reference references/omni-specific-blocks.md, but note it may be incomplete):
gpu_ar_model_runner.py for all Omni-new marked code blocksreferences/omni-specific-blocks.mdUpdate sample_tokens (also compare with GPU implementation):
gpu_ar_model_runner.py's sample_tokens methodOmni-new marked code blocksNote: Generation model runner may have unique omni logic for diffusion/non-AR models.
Compare with GPU gpu_generation_model_runner.py - grep for all Omni-new blocks
Update execute_model:
seq_token_counts injectionUpdate _dummy_run:
_dummy_run if existsCheck and update imports at the top of each file:
# Common vllm-ascend imports
from vllm_ascend.ascend_forward_context import get_forward_context, set_ascend_forward_context
from vllm_ascend.attention.attention_v1 import AscendAttentionState
from vllm_ascend.attention.utils import using_paged_attention
from vllm_ascend.compilation.acl_graph import ACLGraphWrapper, update_full_graph_params
from vllm_ascend.ops.rotary_embedding import update_cos_sin
from vllm_ascend.utils import enable_sp, lmhead_tp_enable
from vllm_ascend.worker.model_runner_v1 import SEQ_LEN_WITH_MAX_PA_WORKSPACE, NPUModelRunner
# Omni-specific imports
from vllm_omni.model_executor.models.output_templates import OmniOutput
from vllm_omni.worker.gpu_model_runner import OmniGPUModelRunner
from vllm_omni.outputs import OmniModelRunnerOutput
from vllm_omni.distributed.omni_connectors.kv_transfer_manager import OmniKVTransferManager
Check recent GPU worker changes:
git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_model_runner.py
git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_ar_model_runner.py
Identify new omni features that need to be ported to NPU
Apply corresponding changes to NPU runners
Run type checking:
cd /root/vllm-workspace/vllm-omni
python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py
python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py
python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py
Run import test:
python -c "from vllm_omni.platforms.npu.worker import *"
Run model serving test (if hardware available):
vllm serve <model-path> --trust-remote-code
set_forward_contextset_ascend_forward_contextCUDAGraphWrapperACLGraphWrapper_make_buffer returns different structureAscendCommonAttentionMetadataAscendSamplerCUDAGraphWrapper references in NPU codeset_ascend_forward_context used instead of set_forward_contextACLGraphWrapper used for talker_mtp wrappingWhen upgrading, keep these files open for reference:
/root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py/root/vllm-workspace/vllm/vllm/v1/worker/gpu_model_runner.py/root/vllm-workspace/vllm-omni/vllm_omni/worker/gpu_model_runner.py