Name: Sglang Diffusion Add Model
Author: sgl-project

Search skills.../

Sglang Diffusion Add Model | Skills Pool

Situation	Recommended Style
Model has unique/complex pre-processing (VLM captioning, AR token generation, custom latent packing, etc.)	Hybrid — consolidate into a BeforeDenoisingStage
Model fits neatly into standard text-to-image or text+image-to-image pattern	Modular — use `add_standard_t2i_stages()` / `add_standard_ti2i_stages()`
Porting a Diffusers pipeline with many custom steps	Hybrid — copy the `__call__` logic into a single stage
Adding a variant of an existing model that shares most logic	Modular — reuse existing stages, customize via PipelineConfig callbacks
A specific pre-processing step needs special parallelism or profiling isolation	Modular — extract that step as a dedicated stage

Purpose	Path
Pipeline classes	`python/sglang/multimodal_gen/runtime/pipelines/`
Model-specific stages	`python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages/`
PipelineStage base class	`python/sglang/multimodal_gen/runtime/pipelines_core/stages/base.py`
Pipeline base class	`python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py`
Standard stages (Denoising, Decoding)	`python/sglang/multimodal_gen/runtime/pipelines_core/stages/`
Pipeline configs	`python/sglang/multimodal_gen/configs/pipeline_configs/`
Sampling params	`python/sglang/multimodal_gen/configs/sample/`
DiT model implementations	`python/sglang/multimodal_gen/runtime/models/dits/`
VAE implementations	`python/sglang/multimodal_gen/runtime/models/vaes/`
Encoder implementations	`python/sglang/multimodal_gen/runtime/models/encoders/`
Scheduler implementations	`python/sglang/multimodal_gen/runtime/models/schedulers/`
Model/VAE/DiT configs	`python/sglang/multimodal_gen/configs/models/dits/`, `vaes/`, `encoders/`
Central registry	`python/sglang/multimodal_gen/registry.py`

# python/sglang/multimodal_gen/runtime/models/dits/my_model.py

import torch
import torch.nn as nn

from sglang.multimodal_gen.runtime.layers.layernorm import (
    LayerNormScaleShift,
    RMSNormScaleShift,
)
from sglang.multimodal_gen.runtime.layers.attention.selector import (
    get_attn_backend,
)


class MyModelTransformer2DModel(nn.Module):
    """DiT model for MyModel.

    Adapt from the Diffusers/reference implementation. Key points:
    - Use SGLang's fused LayerNorm/RMSNorm ops (see `existing-fast-paths.md` under the benchmark/profile skill)
    - Use SGLang's attention backend selector
    - Keep the same parameter naming as Diffusers for weight loading compatibility
    """

    def __init__(self, config):
        super().__init__()
        # ... model layers ...

    def forward(
        self,
        hidden_states: torch.Tensor,
        encoder_hidden_states: torch.Tensor,
        timestep: torch.Tensor,
        # ... model-specific kwargs ...
    ) -> torch.Tensor:
        # ... forward pass ...
        return output

from sglang.multimodal_gen.runtime.distributed import (
    divide,
    get_sp_group,
    get_sp_world_size,
    get_tp_world_size,
    sequence_model_parallel_all_gather,
)
from sglang.multimodal_gen.runtime.layers.linear import (
    ColumnParallelLinear,
    RowParallelLinear,
    ReplicatedLinear,
)

# python/sglang/multimodal_gen/configs/models/dits/mymodel.py

from dataclasses import dataclass, field

from sglang.multimodal_gen.configs.models.dits.base import DiTConfig


@dataclass
class MyModelDitConfig(DiTConfig):
    arch_config: dict = field(default_factory=lambda: {
        "in_channels": 16,
        "num_layers": 24,
        "patch_size": 2,
        # ... model-specific architecture params ...
    })

from dataclasses import dataclass, field

from sglang.multimodal_gen.configs.models.vaes.base import VAEConfig


@dataclass
class MyModelVAEConfig(VAEConfig):
    vae_scale_factor: int = 8
    # ... VAE-specific params ...

from dataclasses import dataclass

from sglang.multimodal_gen.configs.sample.base import SamplingParams


@dataclass
class MyModelSamplingParams(SamplingParams):
    num_inference_steps: int = 50
    guidance_scale: float = 7.5
    height: int = 1024
    width: int = 1024
    # ... model-specific defaults ...

# python/sglang/multimodal_gen/configs/pipeline_configs/my_model.py

from dataclasses import dataclass, field

from sglang.multimodal_gen.configs.pipeline_configs.base import (
    ImagePipelineConfig,      # for image generation
    # SpatialImagePipelineConfig,  # alternative base
    # VideoPipelineConfig,         # for video generation
)
from sglang.multimodal_gen.configs.models.dits.mymodel import MyModelDitConfig
from sglang.multimodal_gen.configs.models.vaes.mymodel import MyModelVAEConfig


@dataclass
class MyModelPipelineConfig(ImagePipelineConfig):
    """Pipeline config for MyModel.

    This config provides callbacks that the standard DenoisingStage and
    DecodingStage use during execution. The BeforeDenoisingStage handles
    all model-specific pre-processing independently.
    """

    task_type: ModelTaskType = ModelTaskType.T2I
    vae_precision: str = "bf16"
    should_use_guidance: bool = True
    vae_tiling: bool = False
    enable_autocast: bool = False

    dit_config: DiTConfig = field(default_factory=MyModelDitConfig)
    vae_config: VAEConfig = field(default_factory=MyModelVAEConfig)

    # --- Callbacks used by DenoisingStage ---

    def get_freqs_cis(self, batch, device, rotary_emb, dtype):
        """Prepare rotary position embeddings for the DiT."""
        # Model-specific RoPE computation
        ...
        return freqs_cis

    def prepare_pos_cond_kwargs(self, batch, latent_model_input, t, **kwargs):
        """Build positive conditioning kwargs for each denoising step."""
        return {
            "hidden_states": latent_model_input,
            "encoder_hidden_states": batch.prompt_embeds[0],
            "timestep": t,
            # ... model-specific kwargs ...
        }

    def prepare_neg_cond_kwargs(self, batch, latent_model_input, t, **kwargs):
        """Build negative conditioning kwargs for CFG."""
        return {
            "hidden_states": latent_model_input,
            "encoder_hidden_states": batch.negative_prompt_embeds[0],
            "timestep": t,
            # ... model-specific kwargs ...
        }

    # --- Callbacks used by DecodingStage ---

    def get_decode_scale_and_shift(self):
        """Return (scale, shift) for latent denormalization before VAE decode."""
        return self.vae_config.latents_std, self.vae_config.latents_mean

    def post_denoising_loop(self, latents, batch):
        """Optional post-processing after the denoising loop finishes."""
        return latents.to(torch.bfloat16)

    def post_decoding(self, frames, server_args):
        """Optional post-processing after VAE decoding."""
        return frames

# python/sglang/multimodal_gen/runtime/pipelines_core/stages/model_specific_stages/my_model.py

import torch
from typing import List, Optional, Union

from sglang.multimodal_gen.runtime.pipelines_core.schedule_batch import Req
from sglang.multimodal_gen.runtime.pipelines_core.stages.base import PipelineStage
from sglang.multimodal_gen.runtime.server_args import ServerArgs
from sglang.multimodal_gen.runtime.distributed import get_local_torch_device
from sglang.multimodal_gen.runtime.utils.logging_utils import init_logger

logger = init_logger(__name__)


class MyModelBeforeDenoisingStage(PipelineStage):
    """Monolithic pre-processing stage for MyModel.

    Consolidates all logic before the denoising loop:
    - Input validation
    - Text/image encoding
    - Latent preparation
    - Timestep/sigma computation

    This stage produces a Req batch with all fields required by
    the standard DenoisingStage.
    """

    def __init__(self, vae, text_encoder, tokenizer, transformer, scheduler):
        super().__init__()
        self.vae = vae
        self.text_encoder = text_encoder
        self.tokenizer = tokenizer
        self.transformer = transformer
        self.scheduler = scheduler
        # ... other initialization (image processors, scale factors, etc.) ...

    # --- Internal helper methods ---
    # Copy/adapt directly from the Diffusers reference pipeline.
    # These are private to this stage; no need to make them reusable.

    def _encode_prompt(self, prompt, device, dtype):
        """Encode text prompt into embeddings."""
        # ... model-specific text encoding logic ...
        return prompt_embeds, negative_prompt_embeds

    def _prepare_latents(self, batch_size, height, width, dtype, device, generator):
        """Create initial noisy latents."""
        # ... model-specific latent preparation ...
        return latents

    def _prepare_timesteps(self, num_inference_steps, device):
        """Compute the timestep/sigma schedule."""
        # ... model-specific timestep computation ...
        return timesteps, sigmas

    # --- Main forward method ---

    @torch.no_grad()
    def forward(self, batch: Req, server_args: ServerArgs) -> Req:
        """Execute all pre-processing and populate batch for DenoisingStage.

        This method mirrors the first half of a Diffusers pipeline __call__,
        up to (but not including) the denoising loop.
        """
        device = get_local_torch_device()
        dtype = torch.bfloat16
        generator = torch.Generator(device=device).manual_seed(batch.seed)

        # 1. Encode prompt
        prompt_embeds, negative_prompt_embeds = self._encode_prompt(
            batch.prompt, device, dtype
        )

        # 2. Prepare latents
        latents = self._prepare_latents(
            batch_size=1,
            height=batch.height,
            width=batch.width,
            dtype=dtype,
            device=device,
            generator=generator,
        )

        # 3. Prepare timesteps
        timesteps, sigmas = self._prepare_timesteps(
            batch.num_inference_steps, device
        )

        # 4. Populate batch with everything DenoisingStage needs
        batch.prompt_embeds = [prompt_embeds]
        batch.negative_prompt_embeds = [negative_prompt_embeds]
        batch.latents = latents
        batch.timesteps = timesteps
        batch.num_inference_steps = len(timesteps)
        batch.sigmas = sigmas
        batch.generator = generator
        batch.raw_latent_shape = latents.shape
        batch.height = batch.height
        batch.width = batch.width

        return batch

Field	Type	Description
`batch.latents`	`torch.Tensor`	Initial noisy latent tensor
`batch.timesteps`	`torch.Tensor`	Timestep schedule
`batch.num_inference_steps`	`int`	Number of denoising steps
`batch.sigmas`	`list[float]`	Sigma schedule (as a list, not numpy)
`batch.prompt_embeds`	`list[torch.Tensor]`	Positive prompt embeddings (wrapped in list)
`batch.negative_prompt_embeds`	`list[torch.Tensor]`	Negative prompt embeddings (wrapped in list)
`batch.generator`	`torch.Generator`	RNG generator for reproducibility
`batch.raw_latent_shape`	`tuple`	Original latent shape before any packing
`batch.height` / `batch.width`	`int`	Output dimensions

# python/sglang/multimodal_gen/runtime/pipelines/my_model.py

from sglang.multimodal_gen.runtime.pipelines_core import LoRAPipeline
from sglang.multimodal_gen.runtime.pipelines_core.composed_pipeline_base import (
    ComposedPipelineBase,
)
from sglang.multimodal_gen.runtime.pipelines_core.stages import DenoisingStage
from sglang.multimodal_gen.runtime.pipelines_core.stages.model_specific_stages.my_model import (
    MyModelBeforeDenoisingStage,
)
from sglang.multimodal_gen.runtime.server_args import ServerArgs


class MyModelPipeline(LoRAPipeline, ComposedPipelineBase):
    pipeline_name = "MyModelPipeline"  # Must match model_index.json _class_name

    _required_config_modules = [
        "text_encoder",
        "tokenizer",
        "vae",
        "transformer",
        "scheduler",
        # ... list all modules from model_index.json ...
    ]

    def create_pipeline_stages(self, server_args: ServerArgs):
        # 1. Monolithic pre-processing (model-specific)
        self.add_stage(
            MyModelBeforeDenoisingStage(
                vae=self.get_module("vae"),
                text_encoder=self.get_module("text_encoder"),
                tokenizer=self.get_module("tokenizer"),
                transformer=self.get_module("transformer"),
                scheduler=self.get_module("scheduler"),
            ),
        )

        # 2. Standard denoising loop (framework-provided)
        self.add_stage(
            DenoisingStage(
                transformer=self.get_module("transformer"),
                scheduler=self.get_module("scheduler"),
            ),
        )

        # 3. Standard VAE decoding (framework-provided)
        self.add_standard_decoding_stage()


# REQUIRED: This is how the registry discovers the pipeline
EntryClass = [MyModelPipeline]

register_configs(
    model_family="my_model",
    sampling_param_cls=MyModelSamplingParams,
    pipeline_config_cls=MyModelPipelineConfig,
    hf_model_paths=[
        "org/my-model-name",  # HuggingFace model ID(s)
    ],
)

Model	Pipeline	BeforeDenoisingStage	PipelineConfig
GLM-Image	`runtime/pipelines/glm_image.py`	`stages/model_specific_stages/glm_image.py`	`configs/pipeline_configs/glm_image.py`
Qwen-Image-Layered	`runtime/pipelines/qwen_image.py` (`QwenImageLayeredPipeline`)	`stages/model_specific_stages/qwen_image_layered.py`	`configs/pipeline_configs/qwen_image.py` (`QwenImageLayeredPipelineConfig`)

Model	Pipeline	Notes
Qwen-Image (T2I)	`runtime/pipelines/qwen_image.py`	Uses `add_standard_t2i_stages()` — standard text encoding + latent prep fits this model
Qwen-Image-Edit	`runtime/pipelines/qwen_image.py`	Uses `add_standard_ti2i_stages()` — standard image-to-image flow
Flux	`runtime/pipelines/flux.py`	Uses `add_standard_t2i_stages()` with custom `prepare_mu`
Wan	`runtime/pipelines/wan_pipeline.py`	Uses `add_standard_ti2v_stages()`

Sglang Diffusion Add Model

Add a Diffusion Model to SGLang

Two Pipeline Styles

Style A: Hybrid Monolithic Pipeline (Recommended)

Sglang Diffusion Add Model

Add a Diffusion Model to SGLang

Two Pipeline Styles

Style A: Hybrid Monolithic Pipeline (Recommended)

Style B: Modular Composition Style

How to Choose

Key Files and Directories

Step-by-Step Implementation

Step 1: Obtain and Study the Reference Implementation

Step 2: Evaluate Reuse of Existing Pipelines and Stages

Step 3: Implement Model Components

Step 4: Create Model Configs

Step 5: Create PipelineConfig

Step 6: Implement the BeforeDenoisingStage (Core Step)

Step 7: Define the Pipeline Class

Step 8: Register the Model

Step 9: Verify Output Quality

Reference Implementations

Hybrid Style (recommended for most new models)

Modular Style (when standard stages fit well)

Checklist

Common Pitfalls

After Implementation: Tests and Performance Data

Bluebubbles

Add Tracing

Analytics Events

Add Expert

Arthas

Arthas Eagleeye Traceid