Skill-Datei

Vllm Omni Npu Model Runner Upgrade

Name: Vllm Omni Npu Model Runner Upgrade
Author: vllm-project

Upgrade vllm-omni NPU model runners (OmniNPUModelRunner, NPUARModelRunner, NPUGenerationModelRunner) to align with the latest vllm-ascend NPUModelRunner while preserving omni-specific logic.

vllm-project4,401 Sterne18.04.2026

Beruf
Kategorien: Machine Learning

Skill-Inhalt

vLLM-Omni NPU Model Runner Upgrade Skill

Overview

This skill guides the process of upgrading vllm-omni's NPU model runners to align with the latest vllm-ascend codebase while preserving omni-specific enhancements. The NPU runners are designed to run omni multimodal models (like Qwen3-Omni, Bagel, MiMoAudio) on Ascend NPUs.

File Structure

NPU Model Runner Files

vllm-omni/vllm_omni/platforms/npu/worker/
├── __init__.py
├── npu_model_runner.py           # OmniNPUModelRunner (base class)
├── npu_ar_model_runner.py        # NPUARModelRunner (autoregressive)
├── npu_ar_worker.py              # AR worker
├── npu_generation_model_runner.py # NPUGenerationModelRunner (diffusion/non-AR)
└── npu_generation_worker.py      # Generation worker

GPU Reference Files (for omni-specific logic sync)

Verwandte Skills

Vllm Omni Npu Model Runner Upgrade | Skills Pool

vllm-omni/vllm_omni/worker/
├── __init__.py
├── gpu_model_runner.py           # OmniGPUModelRunner
├── gpu_ar_model_runner.py        # GPUARModelRunner
├── gpu_ar_worker.py
├── gpu_generation_model_runner.py
├── gpu_generation_worker.py
├── mixins.py
├── base.py
└── gpu_memory_utils.py

vllm-ascend/vllm_ascend/worker/
├── model_runner_v1.py            # NPUModelRunner (base class to copy from)
├── npu_input_batch.py
├── block_table.py
├── pcp_utils.py
└── worker.py

                    GPUModelRunner (vllm)
                         |
        +----------------+----------------+
        |                                 |
  OmniGPUModelRunner              NPUModelRunner (vllm-ascend)
  (vllm_omni/worker)              (vllm_ascend/worker)
        |                                 |
        +----------- OmniNPUModelRunner --+
                     (multiple inheritance)
                            |
            +---------------+---------------+
            |                               |
    NPUARModelRunner            NPUGenerationModelRunner
    (autoregressive)            (non-autoregressive/diffusion)

# -------------------------------------- Omni-new -------------------------------------------------
# ... omni-specific code ...
# -------------------------------------- Omni-new -------------------------------------------------

#  -------------------------------------- Omni-new -------------------------------------------------
#  ------------------------------------------------------------------------------------------------

Method	Description	Omni-Specific Logic
`load_model`	Load model and initialize talker_mtp	Uses `ACLGraphWrapper` instead of `CUDAGraphWrapper`, initializes talker buffers
`_dummy_run`	Warmup/profiling run	talker_mtp dummy forward, `extract_multimodal_outputs`
`_model_forward`	Forward pass wrapper	Injects `model_kwargs_extra`, wraps with `OmniOutput`, NPU-specific graph updates
`_talker_mtp_forward`	Talker MTP forward for Qwen3-Omni	Uses `set_ascend_forward_context`

Method	Description	Omni-Specific Logic
`__init__`	Initialize with KV transfer manager	`OmniKVTransferManager` setup
`execute_model`	Main inference entry	KV transfer handling, `_update_states` override, `extract_multimodal_outputs`
`sample_tokens`	Token sampling	Hidden states extraction, multimodal outputs processing, `OmniModelRunnerOutput`
`_resolve_global_request_id`	Request ID resolution	For disaggregated inference

Method	Description	Omni-Specific Logic
`_update_request_states`	Update request states for async chunk	async_chunk handling
`execute_model`	Generation forward	async_chunk, `seq_token_counts`, `_run_generation_model`
`sample_tokens`	Output processing	multimodal output packaging to `OmniModelRunnerOutput`
`_dummy_run`	Dummy run override	model_kwargs initialization, multimodal extraction
`_run_generation_model`	Run generation model	Calls `_model_forward` with sampler

Identify target versions(Use gh cli to check):
- We're using vllm-omni main branch
- Check the last release of vllm-omni
- Target vllm-ascend version(Just directly use the local latest vllm-ascend code)

Check GPU-side changes (since last release):

cd /root/vllm-workspace/vllm-omni
git log --oneline --since="<last-release-date>" -- vllm_omni/worker/

Read latest vllm-ascend code:
- We don't track vllm-ascend changes - just directly use the latest code from /root/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py
- Copy the relevant methods and re-insert omni-specific blocks

Extract existing omni-specific blocks:

grep -n "Omni-new" vllm_omni/platforms/npu/worker/npu_model_runner.py

Document each omni block:
- Which method it belongs to
- What functionality it provides
- Dependencies on other omni code

# Common vllm-ascend imports
from vllm_ascend.ascend_forward_context import get_forward_context, set_ascend_forward_context
from vllm_ascend.attention.attention_v1 import AscendAttentionState
from vllm_ascend.attention.utils import using_paged_attention
from vllm_ascend.compilation.acl_graph import ACLGraphWrapper, update_full_graph_params
from vllm_ascend.ops.rotary_embedding import update_cos_sin
from vllm_ascend.utils import enable_sp, lmhead_tp_enable
from vllm_ascend.worker.model_runner_v1 import SEQ_LEN_WITH_MAX_PA_WORKSPACE, NPUModelRunner

# Omni-specific imports
from vllm_omni.model_executor.models.output_templates import OmniOutput
from vllm_omni.worker.gpu_model_runner import OmniGPUModelRunner
from vllm_omni.outputs import OmniModelRunnerOutput
from vllm_omni.distributed.omni_connectors.kv_transfer_manager import OmniKVTransferManager

Check recent GPU worker changes:

git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_model_runner.py
git diff <from-tag>..<to-tag> -- vllm_omni/worker/gpu_ar_model_runner.py

Identify new omni features that need to be ported to NPU
Apply corresponding changes to NPU runners

Run type checking:

cd /root/vllm-workspace/vllm-omni
python -m py_compile vllm_omni/platforms/npu/worker/npu_model_runner.py
python -m py_compile vllm_omni/platforms/npu/worker/npu_ar_model_runner.py
python -m py_compile vllm_omni/platforms/npu/worker/npu_generation_model_runner.py

Run import test:

python -c "from vllm_omni.platforms.npu.worker import *"

Run model serving test (if hardware available):
```
vllm serve <model-path> --trust-remote-code
```

Vllm Omni Npu Model Runner Upgrade

vLLM-Omni NPU Model Runner Upgrade Skill

Overview

File Structure

NPU Model Runner Files

GPU Reference Files (for omni-specific logic sync)

Vllm Omni Npu Model Runner Upgrade

vLLM-Omni NPU Model Runner Upgrade Skill

Overview

File Structure

NPU Model Runner Files

GPU Reference Files (for omni-specific logic sync)

vllm-ascend Reference Files

Inheritance Hierarchy

Omni-Specific Comment Markers

Key Methods Requiring Attention

OmniNPUModelRunner (npu_model_runner.py)

NPUARModelRunner (npu_ar_model_runner.py)

NPUGenerationModelRunner (npu_generation_model_runner.py)

Upgrade Workflow

Step 1: Preparation

Step 2: Analyze Omni-Specific Logic

Step 3: Update Base Class (OmniNPUModelRunner)

Step 4: Update AR Model Runner

Step 5: Update Generation Model Runner

Step 6: Update Imports

Step 7: Sync GPU-Side Omni Changes

Step 8: Validation

Common Pitfalls

1. Forward Context Differences

2. Graph Wrapper Differences

3. Buffer Creation

4. Attention Metadata

5. Sampling

Checklist Before Commit

Reference Files for Comparison

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns