Build Stable Diffusion pipelines with HuggingFace Diffusers. Use when generating images from text, performing img2img, inpainting, using ControlNet, loading LoRAs, fine-tuning with DreamBooth, or deploying diffusion models. Triggers: 'Stable Diffusion', 'SDXL', 'SD 3', 'Flux', 'Diffusers', 'text-to-image', 'inpainting', 'ControlNet', 'LoRA training', 'DreamBooth'.
Claude already knows Diffusers API basics. This skill adds expert judgment for non-obvious choices.
| Need | Model | Why |
|---|---|---|
| General purpose, fast | SD 1.5 | Smallest VRAM, largest LoRA ecosystem |
| High quality, 1024px native | SDXL | Best quality/speed tradeoff today |
| Best text rendering | SD 3.0 / SD 3.5 | T5 encoder understands text layout |
| Fastest single-step | SDXL + LCM LoRA | 4 steps, guidance_scale=1.0 — MUST use LCM scheduler |
| Best open-source overall | Flux.1 [dev] | Superior prompt following, but ~24GB VRAM |
Before choosing, ask:
enable_model_cpu_offload() to halve)Claude knows the scheduler list. Here's what it doesn't know:
LCMScheduler with LCM LoRAs.| Model | Sweet spot | Too low | Too high |
|---|---|---|---|
| SD 1.5 | 7-9 | <5: ignores prompt | >12: oversaturated, artifacts |
| SDXL | 5-8 | <3: unfocused | >10: harsh contrast |
| SD 3.x | 4-7 | <2: random | >8: burnt highlights |
| Flux | 3.5 (fixed) | N/A | N/A — Flux uses guidance embedding, not CFG |
| LCM | 1.0 (fixed) | N/A | >2: destroys output completely |
guidance_scale > 2 with LCM — it will produce noise/artifacts. LCM was trained with guidance_scale=1.0.height/width that aren't multiples of 8 — silent misalignment causes border artifacts.pipe.vae = pipe.vae.to(dtype=torch.float32) when getting black images — VAE is numerically unstable in fp16 on many models (especially SDXL).num_inference_steps < 15 without LCM/Turbo — standard schedulers need minimum ~20 steps for coherent output..fuse_lora() or set scale after loading a LoRA — without this, the LoRA has zero effect and you'll debug for hours.enable_sequential_cpu_offload() in production — it's ~3x slower than enable_model_cpu_offload(). Use it only when the latter still OOMs.(word:1.5) instead of repetition.| I want to preserve... | ControlNet | Preprocessor | Key gotcha |
|---|---|---|---|
| Overall structure | canny | Canny edge detection | Low/high thresholds matter hugely — default (100,200) often too aggressive |
| Human pose only | openpose | OpenPose | Fails silently on non-human subjects |
| 3D spatial layout | depth | MiDaS / Zoe | conditioning_scale > 1.0 causes depth map to override prompt entirely |
| Architectural lines | mlsd | M-LSD | Only detects straight lines — useless for organic subjects |
| Rough concept | scribble | HED / pidinet | Most forgiving — good starting point when unsure |
strength calibration0.3-0.4: Color/style shift, structure preserved (touch-up)0.5-0.7: Significant transformation, composition kept (style transfer)0.8-1.0: Near-complete regeneration (only vague shapes kept)Apply in this order, stop when VRAM fits:
torch_dtype=torch.float16 (always do this — 50% savings, no quality loss)pipe.enable_model_cpu_offload() (moderate slowdown, large savings)pipe.enable_vae_slicing() + pipe.enable_vae_tiling() (needed for >1024px)pipe.enable_attention_slicing() (small extra savings)pipe.enable_xformers_memory_efficient_attention() (if xformers available)pipe.enable_sequential_cpu_offload() (last resort — very slow)Load these only when needed for the specific task:
Advanced Usage — MANDATORY before: building custom pipelines from components, writing custom denoising loops, fine-tuning (DreamBooth/LoRA/Textual Inversion), IP-Adapter, SDXL Refiner, T2I-Adapter, quantization, production deployment, callbacks, multi-GPU. Do NOT load for standard generation, img2img, inpainting, or ControlNet — the SKILL.md body is sufficient.
Troubleshooting — MANDATORY when: encountering errors (CUDA OOM, black images, package conflicts, LoRA loading failures, ControlNet not conditioning, hub download issues). Do NOT load proactively — only when a specific error needs debugging.