Name: Pytorch Fsdp2
Author: Orchestra-Research

Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script

This skill teaches a coding agent how to add PyTorch FSDP2 to a training loop with correct initialization, sharding, mixed precision/offload configuration, and checkpointing.

FSDP2 in PyTorch is exposed primarily via torch.distributed.fsdp.fully_shard and the FSDPModule methods it adds in-place to modules. See: references/pytorch_fully_shard_api.md, references/pytorch_fsdp2_tutorial.md.

When to use this skill

Use FSDP2 when:

Your model doesn’t fit on one GPU (parameters + gradients + optimizer state).
You want an eager-mode sharding approach that is DTensor-based per-parameter sharding (more inspectable, simpler sharded state dicts) than FSDP1.
You may later compose DP with using .

Pytorch Fsdp2

Pytorch Fsdp2

Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script

When to use this skill

Alternatives (when FSDP2 is not the best fit)

Contract the agent must follow

Step-by-step procedure

0) Version & environment sanity

1) Initialize distributed and set device

2) Build model on meta device (recommended for very large models)

3) Apply `fully_shard()` bottom-up (wrapping policy = “apply where needed”)

4) Configure `reshard_after_forward` for memory/perf trade-offs

5) Mixed precision & offload (optional but common)

6) Optimizer, gradient clipping, accumulation

7) Checkpointing: prefer DCP or distributed state dict helpers

Workflow checklists (copy-paste friendly)

Workflow A: Retrofit FSDP2 into an existing training script

Workflow B: Add DCP save/load (minimal pattern)

Debug checklist (what the agent should check first)

Common issues and fixes

Minimal reference implementation outline (agent-friendly)

References

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2

Pytorch Fsdp2

Pytorch Fsdp2

Skill: Use PyTorch FSDP2 (fully_shard) correctly in a training script

When to use this skill

Alternatives (when FSDP2 is not the best fit)

Contract the agent must follow

Step-by-step procedure

0) Version & environment sanity

1) Initialize distributed and set device

2) Build model on meta device (recommended for very large models)

3) Apply fully_shard() bottom-up (wrapping policy = “apply where needed”)

4) Configure reshard_after_forward for memory/perf trade-offs

5) Mixed precision & offload (optional but common)

6) Optimizer, gradient clipping, accumulation

7) Checkpointing: prefer DCP or distributed state dict helpers

Workflow checklists (copy-paste friendly)

Workflow A: Retrofit FSDP2 into an existing training script

Workflow B: Add DCP save/load (minimal pattern)

Debug checklist (what the agent should check first)

Common issues and fixes

Minimal reference implementation outline (agent-friendly)

References

Pytorch Patterns

Regex Vs Llm Structured Text

Effect

Flags

WPF to WinUI 3 Migration Skill

At Dispatch V2

Skill: Use PyTorch FSDP2 (`fully_shard`) correctly in a training script

3) Apply `fully_shard()` bottom-up (wrapping policy = “apply where needed”)

4) Configure `reshard_after_forward` for memory/perf trade-offs