Calculate which LLM and embedding models fit in available memory at each quantization level. Requires a hardware profile from hardware-profiler. Use after hardware profiling to determine feasible model sizes.
Compute feasible (model_size, quantization) pairs based on available memory.
Read results/phase1/hardware_profile.json. If it doesn't exist, run the
hardware-profiler skill first.
required_memory_gb = (parameters_billions × bits_per_weight) / 8 + overhead
Where:
parameters_billions = model parameter count (e.g., 7, 13, 34, 70, 235)bits_per_weight = quantization bits (4 for Q4, 5 for Q5, 6 for Q6, 8 for Q8,
16 for FP16)overhead = 15% of the raw weight size (covers KV-cache, activations, framework)results/phase1/memory_budget.json and print as a table.