Name: Llm Creation
Author: merceralex397-collab

eval-dataset-design

Define architecture. Choose a decoder-only transformer configuration:

Model size guide:
- 1B:  num_layers=24,  hidden_dim=2048,  num_heads=16,  context=2048
- 7B:  num_layers=32,  hidden_dim=4096,  num_heads=32,  context=4096
- 13B: num_layers=40,  hidden_dim=5120,  num_heads=40,  context=4096
- 70B: num_layers=80,  hidden_dim=8192,  num_heads=64,  context=8192

Use RoPE positional embeddings, SwiGLU activation, RMSNorm, GQA (grouped-query attention) for ≥13B models.

Apply scaling laws. Chinchilla-optimal: train on ~20 tokens per parameter (a 7B model needs ~140B tokens). Estimate total FLOPs: C ≈ 6 × N × D where N = parameters, D = tokens. Budget GPU-hours: C / (GPU_TFLOPS × utilization × 3600), target 40–50% MFU.
Train tokenizer. Train a BPE tokenizer (SentencePiece or HuggingFace tokenizers) on a representative corpus sample. Vocab size: 32k–64k. Ensure coverage of code, multilingual text, and special tokens (<|begin_of_text|>, <|end_of_text|>, chat role markers).
Configure distributed pretraining. Choose framework:
- FSDP (PyTorch native): FullyShardedDataParallel with mixed precision (bf16), activation checkpointing.
- DeepSpeed ZeRO Stage 3: partitions optimizer states, gradients, and parameters.
- Megatron-LM: tensor + pipeline parallelism for >70B models. Learning rate: peak 3e-4, cosine decay to 3e-5, warmup over first 2000 steps. Batch size: ramp from 256 to 4M tokens over warmup.
Monitor training. Log training loss, gradient norm, and learning rate every 10 steps. Evaluate perplexity on held-out validation set every 1000 steps. Run downstream benchmarks (MMLU, HellaSwag, HumanEval) at 25%, 50%, 75%, and 100% of training.
Post-training pipeline. After pretraining: instruction-tune with SFTTrainer (delegate to instruction-tuning skill), then align with DPO or RLHF (delegate to preference-optimization skill).
Document and release. Produce a model card: architecture, training data composition, compute used, benchmark results, known limitations, intended use, and license.

Llm Creation | Skills Pool

Llm Creation

Llm Creation

Purpose

When to use this skill

Do not use this skill when

Operating procedure

Decision rules

Output requirements

References

Failure handling

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Llm Creation

Llm Creation

Purpose

When to use this skill

Do not use this skill when

Operating procedure

Decision rules

Output requirements

References

Related skills

Failure handling

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns