Plans and executes end-to-end LLM creation from architecture design through pretraining, instruction tuning, and alignment. Covers scaling laws, compute budgets, tokenizer training, distributed training infrastructure, and evaluation checkpoints. Use when building a new language model from scratch.
Plan and execute the full pipeline for creating an LLM from scratch — architecture design, scaling law calculations, tokenizer training, distributed pretraining, instruction tuning, alignment, evaluation checkpoints, and release documentation.
Use this skill when:
fine-tuning or instruction-tuning)inference-kernel-optimization)eval-dataset-designModel size guide:
- 1B: num_layers=24, hidden_dim=2048, num_heads=16, context=2048
- 7B: num_layers=32, hidden_dim=4096, num_heads=32, context=4096
- 13B: num_layers=40, hidden_dim=5120, num_heads=40, context=4096
- 70B: num_layers=80, hidden_dim=8192, num_heads=64, context=8192
Use RoPE positional embeddings, SwiGLU activation, RMSNorm, GQA (grouped-query attention) for ≥13B models.C ≈ 6 × N × D where N = parameters, D = tokens. Budget GPU-hours: C / (GPU_TFLOPS × utilization × 3600), target 40–50% MFU.<|begin_of_text|>, <|end_of_text|>, chat role markers).FullyShardedDataParallel with mixed precision (bf16), activation checkpointing.instruction-tuning skill), then align with DPO or RLHF (delegate to preference-optimization skill).Architecture spec — model config (layers, dims, heads, context, vocab), parameter countCompute budget — FLOPs estimate, GPU type/count, estimated wall-clock time, costData plan — corpus composition, token counts, deduplication and filtering pipelineTraining config — optimizer, LR schedule, batch size ramp, parallelism strategyEvaluation schedule — checkpoints, benchmarks, and pass/fail criteria at each stageModel card — standard model card with architecture, data, benchmarks, limitationsmodel-architecturedata-cleaning-labelinginstruction-tuningfine-tuningpreference-optimization