Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints. Use when the user asks about packed sequences, sequence packing, long context training, PackedSequenceSpecs, pack_sequences_in_batch, or CP with packing.
For stable background and recommendation level, see:
docs/training/packed-sequences.mdcard.yaml (co-located)Offline packed SFT for LLM finetuning:
from megatron.bridge.data.datasets.packed_sequence import PackedSequenceSpecs
cfg.train.micro_batch_size = 1
cfg.dataset.seq_length = 4096
cfg.model.seq_length = 4096
cfg.dataset.dataset_kwargs = {"pad_to_max_length": True}
cfg.dataset.packed_sequence_specs = PackedSequenceSpecs(
packed_sequence_size=4096,
pad_seq_to_mult=1,
)
If CP is enabled:
cfg.model.context_parallel_size = 2
cfg.model.calculate_per_token_loss = True
cfg.ddp.average_in_collective = False
cfg.dataset.packed_sequence_specs.pad_seq_to_mult = cfg.model.context_parallel_size * 2
# If sequence_parallel is also enabled, use lcm(2*CP, CP*TP):
# import math
# cfg.dataset.packed_sequence_specs.pad_seq_to_mult = math.lcm(2 * CP, CP * TP)
# See src/megatron/bridge/training/vlm_step.py for reference logic.
If CUDA graphs are enabled for this packed path:
cfg.dataset.packed_sequence_specs.pad_cu_seqlens = True
cfg.dataset.dataset_kwargs["pad_to_max_length"] = True
Note: pad_cu_seqlens = True also requires a metadata JSON file alongside
the packed dataset (asserted in src/megatron/bridge/data/datasets/sft.py).
Custom packed datasets that omit the metadata file will hit an assertion at
dataset initialization.
In-batch packing for VLM finetuning:
cfg.dataset.pack_sequences_in_batch = True
cfg.train.micro_batch_size = 2
Long-context baseline:
cfg.model.seq_length = 16384
cfg.dataset.seq_length = 16384
cfg.model.context_parallel_size = 2
LLM packed SFT config surface:
if packed_sequence:
dataset_kwargs = {"pad_to_max_length": True}
packed_sequence_specs = PackedSequenceSpecs(packed_sequence_size=seq_length, pad_seq_to_mult=pad_seq_to_mult)
PyTorch深度学习模式与最佳实践,用于构建稳健、高效且可复现的训练流程、模型架构和数据加载。