Guide implementation of ML models from paper to working code. Use this skill when the user wants to: implement a model architecture from a paper or description, design experiments and ablation studies, set up a training pipeline, structure ML code for iteration speed, plan which components to build first, or create a baseline before adding complexity. Trigger when the user says things like "implement this", "build this model", "code this up", "how should I structure this", "set up the training loop", "design an experiment", or "what should I ablate". Also trigger when someone has a model idea and needs help turning it into code, even if they don't use the word "implement."
You are helping someone turn an ML idea into running code. The goal is not just "code that runs" but code that produces trustworthy results you can iterate on quickly.
The single most important principle: get something running end-to-end before optimizing anything. A broken training loop that trains for 100 steps and shows decreasing loss in 10 minutes is worth infinitely more than a perfect architecture that you haven't tested yet.
Before implementing your fancy new idea, get a dumb baseline working. This serves three purposes:
For language models, the baseline is usually a small standard transformer. For vision, a ResNet or ViT. For your specific domain, whatever the simplest reasonable model is.
The baseline should be embarrassingly simple. If you're implementing ternary quantization, your baseline is the same architecture with normal float weights. If you're implementing a new attention mechanism, your baseline is standard attention. Same everything else.
Don't implement everything at once. The order matters:
Get the data flowing first. Verify shapes, dtypes, and values at every stage. Print a few samples and eyeball them. Common bugs:
Build the model and verify the forward pass works:
# Always do this before training
x = torch.randn(batch_size, seq_len, dim)
y = model(x)
print(f"Input: {x.shape}, Output: {y.shape}")
assert y.shape == expected_shape
For each new component, test it in isolation before plugging it into the full model. If you're implementing a custom attention layer, verify it produces the same output as a known-good implementation on the same input.
Verify gradients flow properly:
loss = criterion(model(x), targets)
loss.backward()
for name, param in model.named_parameters():
if param.grad is None:
print(f"WARNING: No gradient for {name}")
elif param.grad.abs().max() == 0:
print(f"WARNING: Zero gradient for {name}")
Before training on the full dataset, overfit on one batch. The model should drive training loss to near-zero on a single repeated batch. If it can't, something is wrong with the model or training loop -- don't proceed until this works.
batch = next(iter(dataloader))
for step in range(200):
loss = train_step(model, batch)
if step % 20 == 0:
print(f"step {step}: loss {loss:.4f}")
# Loss should be near 0 by step 200
Train on the full dataset for a small number of steps. Verify:
Only now do you run the full training. And even here, start with a shorter run (25% of total steps) before committing to the full thing.
Now that the baseline works, add your new technique as a minimal diff:
Change ONE thing at a time. If you change the architecture AND the optimizer AND the learning rate, you won't know which change helped (or hurt). Make one change, run a short experiment, verify it helps, then move on.
Keep the baseline code accessible. Use config flags, not code deletion. You want to be able to switch back to the baseline at any time.
# Good: config flag
if config.use_ternary_quantization:
weight = quantize_ternary(weight)
# Bad: delete the old code and replace it
Log everything. For each experiment, save:
An ablation study removes components one at a time to measure their individual contribution. This tells you which parts of your system are actually helping.
Start with your best configuration, then remove/change one thing at a time:
| Experiment | Change | Result | Delta |
|---|---|---|---|
| Full model | (baseline) | 1.15 bpb | -- |
| No XSA | Remove XSA layers | 1.16 bpb | +0.01 |
| No RoPE | Remove partial RoPE | 1.155 bpb | +0.005 |
| relu instead of relu² | Swap activation | 1.17 bpb | +0.02 |
If you can't afford full ablations, at least verify:
When GPU time is expensive, be strategic:
Proxy metrics. Train for 10% of the total steps and use that loss as a proxy for the final loss. This isn't perfect (some techniques help more late in training), but it's usually directionally correct and 10x cheaper.
Binary search hyperparameters. Don't grid search. Pick two extreme values, test both, then test the midpoint. Repeat. Gets you within 90% of optimal in log(n) runs instead of n.
Kill early. If a run is clearly worse than the baseline after 20% of training, kill it and try something else. Don't hope it'll catch up.
Prioritize by expected impact. If you have 5 ideas and compute for 3 experiments, rank them by (expected improvement) × (probability of working) and run the top 3.
Organize your ML code for fast iteration:
project/
├── config.py # All hyperparameters in one place
├── model.py # Model architecture
├── data.py # Data loading and preprocessing
├── train.py # Training loop
├── eval.py # Evaluation
├── utils.py # Logging, checkpointing, misc
├── experiments/ # One script per experiment variant
│ ├── baseline.sh
│ ├── ternary_v1.sh
│ └── ternary_v2.sh
└── logs/ # Training logs, organized by run
├── baseline_seed42/
└── ternary_v1_seed42/
Config should be a single source of truth. Don't scatter hyperparameters across files. One config object, passed everywhere.
Make runs reproducible. Set seeds, log the exact config, save the git hash. You should be able to recreate any previous run exactly.
model.eval() or torch.no_grad() during validation.item() in the training loop)When something doesn't work: