Add a new image or video generative model to the Draw Things app / CLI with a compile-first, end-to-end workflow across SwiftDiffusion, tokenizer plumbing, text encoder, fixed encoder, UNet / DiT runtime, VAE, converter and quantizer tooling, and CLI validation.
Use this when adding a new image or video generative model, or a new model version, to the Draw Things app / CLI, especially when most of the surrounding components already exist and the task is mainly integration.
This skill is optimized for the current SwiftDiffusion / Draw Things layout:
architecture builder in Libraries/SwiftDiffusion/Sources/Models
weight-loading logic returned as ModelWeightMapper from model builders and helper blocks
text path in Libraries/SwiftDiffusion/Sources/TextEncoder.swift
fixed encoder path in Libraries/SwiftDiffusion/Sources/UNetFixedEncoder.swift
runtime UNet / DiT path in Libraries/SwiftDiffusion/Sources/Models/UNetProtocol.swift
VAE path in Libraries/SwiftDiffusion/Sources/FirstStage.swift
In this repo, UNet in names such as UNetProtocol, UNetFixedEncoder, and UNetExtractConditions is the legacy name for the main diffusion model integration boundary. The underlying architecture may be a DiT or another non-UNet model.
Related Skills
Default Approach
Get the new generative model compiling end-to-end before optimizing or slicing.
Prefer minimal, explicit integration over generic abstractions.
If a new generative model reuses an existing tokenizer, text encoder, or VAE, hook that up first instead of creating a new variant.
For compile sweeps, it is acceptable to add case .newModel: fatalError() placeholders first, then fill them in.
For bring-up, temporarily prefer strict loading such as read(model: "model", strict: true, ...) to surface missing or mismatched keys early. Do not ship that debugging behavior unintentionally.
If text-conditioning boundary placement is ambiguous, choose the boundary that avoids passing large intermediate tensors across module boundaries.
Prefer the repo's usual UNetFixedEncoder / UNetProtocol integration split as the long-term structure, but do not force that split into the first implementation if it makes bring-up materially harder.
Keep the known-good unsplit / unsliced path as the release baseline until any later fixed split or cache path proves output parity.
Do not introduce a model-specific config bag just to build one model graph; follow nearby builders and pass the needed parameters explicitly.
If a reference implementation uses a different tensor layout from the app runtime, adapt layout at the model boundary first before debugging higher-level plumbing.
Prefer the partner runtime path over converter or export harnesses when they disagree about runtime behavior.
Treat text-conditioning contract bugs as first-class integration bugs. Wrong padding, masking, unconditional handling, prompt templates, or adapter boundary placement can preserve tensor shapes while destroying prompt adherence.
Do not assume all runtime side inputs have the same rank. CFG splitting and extracted-condition logic must respect the actual tensor ranks.
If a model removes an external mask or padding input, only synthesize an internal zero mask when the architecture really allows it.
If a model checkpoint needs to flow through an existing asset slot to reach the right subsystem, prefer the existing pass-through pattern over inventing a new file-plumbing path.
If partner implementations exist, prefer the one that matches the shipped model behavior over a loosely related upstream base repo.
If a model supports multiple SDPA scaling modes, keep that as an enum-like runtime choice end to end instead of collapsing it to a boolean.
If a tiled path uses spatial rotary embeddings, generate the full-image rotary tensor once and slice tiles from it. Do not regenerate tile-local rotary tensors unless the partner runtime does that explicitly.
If a fixed split is introduced later, keep self-attention and cross-attention weight naming distinct and assert fixed-output count and ordering so silent misloads fail early.
A first successful CLI run is not enough validation. Run a real sample, inspect the image, and pin --seed when comparing semantic fixes.
Temporary debug prints, env toggles, and strict-load hooks are bring-up tools only. Remove them before handoff and validate the cleaned tree again.
Workflow
1. Add the main model builder and weight mapping
Add the new diffusion builder in Libraries/SwiftDiffusion/Sources/Models/<Model>.swift.
Follow the existing structure used by nearby large-model integrations such as:
prefer the eventual UNetFixedEncoder / UNetProtocol integration split structurally
do not slice the model yet
do not extract adaln or KV-precompute paths yet
usually keep the fixed part unextracted and wire the full model through so end-to-end generation can run first
only pull work into UNetFixedEncoder early if it is already simple, obvious, and low-risk
Only add slicing later if profiling or architecture constraints require it.
make the unsliced path work first
if a later fixed / sliced split changes semantic output, keep the unsplit path as the release baseline until parity is proven
do not leave a speculative fixed split half-wired into the normal runtime path just because it compiles
confirm the runtime model input order matches the actual UNetProtocol call contract
if CFG splitting is enabled, do not assume every side input is rank-3
if UNetExtractConditions is used, only slice tensors that are actually timestep-major extracted conditions
do not slice or index batched text context by sampler step unless the data is explicitly laid out that way
if technical execution succeeds but samples remain noise, compare the attention and residual-dtype path against the partner implementation before changing higher-level app plumbing
if a partner implementation keeps the residual stream in higher precision than attention / FFN, mirror that split if the lower-precision path produces NaNs or semantic collapse
if a split fixed path is added, keep the unsplit path as the output-parity baseline until the split path is proven on real images
if the split path precomputes cross-attention KV or modulation terms, assert the expected number of returned tensors and keep their ordering explicit
if the model has repeated self-attention and cross-attention blocks, do not let their checkpoint keys collide by sharing a naming pattern that only differs by enumeration position
if tiled diffusion is supported, feed full-image rotary into the shared slice path instead of rebuilding tile-local rotary tensors
6. Sweep converter and quantizer tools
If the model is user-facing or importable, extend the non-runtime tools too:
Apps/ModelConverter/Converter.swift
Apps/LoRAConverter/Converter.swift
Apps/ModelQuantizer/Quantizer.swift
Common integration miss:
runtime code builds, but converter or quantizer switches are non-exhaustive
tool help text omits the new model version
quantization policy silently uses the wrong fallback because the new model family is unhandled
checkpoint key remap shims are kept after the converted checkpoint has been regenerated with correct names
7. Hook VAE through the existing first-stage path
If the new model reuses an existing latent contract:
reuse the existing first-stage path in Libraries/SwiftDiffusion/Sources/FirstStage.swift
avoid creating a new VAE branch unless the latent contract actually proves incompatible
Goal:
encode/decode works by routing through the minimal existing first-stage behavior the model is compatible with
8. Validate end-to-end before cleanup
Preferred validation order:
compile the affected diffusion library target
compile the app or CLI path that exercises the model
run an end-to-end generation through bazel run //Apps:DrawThingsCLI -- ...
end-to-end CLI execution may require permission to use the GPU
prefer bazel run //Apps:DrawThingsCLI -- ... directly over swift run in this repo
if the CLI does not expose a runtime knob directly, pass the value through --config-json
if the model is flow-matching or uses a non-default objective/discretization, verify those ModelZoo values explicitly instead of assuming the nearest existing model is correct
when GPU approval is needed for repeated generation comparisons, ask once for a fixed command shape with:
a fixed output image path
a fixed log path
after each run completes, move those generic files to a run-specific name yourself to preserve progress history
this keeps the approved command prefix stable across iterations and avoids re-asking for every output filename change
after the model is working, remove temporary model-specific env toggles and debug prints before handoff, then rerun: