Test system, CI pipeline, and CI failure investigation for Megatron-LM. Covers test layout, recipe YAML structure, adding unit and functional tests, CI scope labels, triggering internal GitLab CI, pipeline structure, and debugging CI failures.
tests/
├── unit_tests/ # pytest, 1 node × 8 GPUs, torch.distributed runner
├── functional_tests/ # end-to-end shell + training scripts
│ └── test_cases/
│ └── {model}/{test_case}/
│ ├── model_config.yaml # training args
│ └── golden_values_{env}_{platform}.json
└── test_utils/
├── recipes/
│ ├── h100/ # YAML recipes for H100 jobs
│ └── gb200/ # YAML recipes for GB200 jobs
└── python_scripts/ # helpers (recipe_parser, golden-value download, …)
The GitHub Actions runner invokes launch_nemo_run_workload.py, which uses
nemo-run to launch a DockerExecutor container. The repo is bind-mounted
at /opt/megatron-lm; training data is mounted at /mnt/artifacts.
Unit tests are dispatched through torch.distributed.run:
{assets_dir}/logs/1/ and are uploaded as a
GitHub artifact after the run.Functional tests are driven by
tests/functional_tests/shell_test_utils/run_ci_test.sh. Only rank 0 runs the
pytest validation step; training output from all ranks is uploaded as an artifact.
Flaky-failure auto-retry: launch_nemo_run_workload.py retries up to
3 times for known transient patterns (NCCL timeout, ECC error, segfault,
HuggingFace connectivity, …) before declaring a genuine failure.
Recipes live in tests/test_utils/recipes/ and are parsed by
tests/test_utils/python_scripts/recipe_parser.py. Each file expands a
cartesian products block into individual workload specs: