Expert guidance for Vision-Language-Action (VLA) robot foundation models — covering architecture design, training pipelines, data strategy, deployment, and evaluation. Use when (1) designing or implementing a generalist robot policy (VLA model), (2) setting up pre-training or fine-tuning pipelines for robot manipulation, (3) choosing action representations (flow matching vs. diffusion vs. autoregressive), (4) structuring multi-embodiment robot datasets, (5) evaluating dexterous manipulation tasks, (6) implementing action chunking or high-level policy decomposition. Based on the pi0 architecture (Physical Intelligence, 2024).
Expert guidance for building generalist robot policies using Vision-Language-Action (VLA) flow models, based on the π0 architecture.
π0 model = VLM backbone + action expert + flow matching
| Component | Detail |
|---|---|
| VLM backbone | PaliGemma (3B) — provides visual + language understanding |
| Action expert | Separate transformer weights (~300M) for robot state + actions |
| Total params | ~3.3B |
| Action output | Chunks of H=50 actions; 50Hz or 20Hz robots |
| Inference speed | ~73ms on RTX 4090 |
See references/architecture.md for full technical details (attention masks, flow matching math, MoE design).
Two-phase approach (mirrors LLM training):
Key rule: combining both phases outperforms either alone. Pre-training gives robustness; fine-tuning gives precision.
See references/training.md for data mixture ratios, loss functions, and fine-tuning dataset sizing.
Use flow matching, not autoregressive discretization.
Single model handles 7+ robot configurations via:
See references/embodiments.md for robot platform specs and action space details.
For long-horizon tasks, use a two-tier approach:
Analogous to SayCan. Intermediate language commands significantly boost performance vs. flat task descriptions.
π0 has been extended and complemented by several key works. See references/related-work.md for the full landscape, including:
When evaluating a robot manipulation policy: