Create a reduced (fewer-layer) version of any HuggingFace model for fast end-to-end testing. The reduced model produces meaningless text but proves the full pipeline (download → load → generate) works with no runtime errors.

This is invaluable when working with large LLMs (hundreds of GB) where the testing-verification loop is painfully slow. Instead of downloading and loading the full model, you work with a 4-layer version that's a fraction of the size.

When to Use This Skill

User wants to test quantization workflows without waiting for the full model
User needs a quick sanity check that a model loads and generates correctly
User wants to reduce download size for development/debugging
User mentions models like DeepSeek-R1, Qwen3, Llama, Mistral and needs a smaller version

Workflow Overview

The process has 6 phases. Each phase builds on the previous one — don't skip ahead.

Phase 1: Gather Information

Before writing any code, you need to understand the model and the environment.

Problem	Cause	Fix
`hf download` only fetches 1 file	Glob patterns space-separated after single `--include`	Use separate `--include` flags for each pattern
`ValueError: device_map requires accelerate`	`accelerate` package missing	`pip install accelerate`
`KeyError: '<model_type>'`	`transformers` version too old for this model	Upgrade to the version specified on the model card
`ImportError: cannot import name 'is_flash_attn_...'`	Model's custom code (e.g., DeepSeek-R1) incompatible with transformers 5.x	Use transformers 4.x (`pip install "transformers>=4.46,<5.0"`). Also download `*.py` files in Phase 3 for models with `trust_remote_code=True`
Shape mismatch errors during loading	Kept wrong layers or missed embedding/norm layers	Check `weight_map` filtering — must keep `embed_tokens`, `norm`, `lm_head`
Model generates nothing / hangs	Too few layers for the generation config	Set `max_new_tokens=20` explicitly; don't rely on model defaults

Reduce Model

Reduce Model

When to Use This Skill

Workflow Overview

Phase 1: Gather Information

Phase 2: Create a Python Virtual Environment

Phase 3: Download Metadata Only

Phase 4: Patch Config and Index

Phase 5: Download Only Needed Shards

Phase 6: Smoke Test

Orchestrator Script

Verification Checklist

Common Pitfalls

Reference Files

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns