Name: Debug Gradient Flow
Author: deepmodeling

Debug Gradient Flow | Skills Pool

def check_grad(trainer, label_overrides=None):
    trainer.wrapper.train()
    trainer.optimizer.zero_grad(set_to_none=True)
    inp, lab = trainer.get_data(is_train=True)
    lr = trainer.scheduler.get_last_lr()[0]

    # Override labels to isolate a single loss component
    if label_overrides:
        lab.update(label_overrides)

    _, loss, more_loss = trainer.wrapper(**inp, cur_lr=lr, label=lab)
    loss.backward()

    status = {}
    for name, p in trainer.wrapper.named_parameters():
        if p.requires_grad:
            has_grad = p.grad is not None and p.grad.abs().sum() > 0
            status[name] = has_grad
    return status

scenarios = {
    "energy only": {"find_force": 0.0, "find_virial": 0.0},
    "force only": {"find_energy": 0.0, "find_virial": 0.0},
    "virial only": {
        "find_energy": 0.0,
        "find_force": 0.0,
        "virial": torch.randn(nframes, 9, ...),  # inject if data lacks virial
        "find_virial": 1.0,
    },
    "all losses": {
        "virial": torch.randn(nframes, 9, ...),
        "find_virial": 1.0,
    },
}

                       Uncompiled  Compiled
energy only:           22/22       22/22
force only:            20/22       16/22    <-- problem
virial only:           20/22       16/22    <-- problem
all losses:            22/22       22/22    <-- OK in practice

print(f"{'Parameter':<60} {'Uncompiled':>10} {'Compiled':>10}")
for name in sorted(status_uncompiled):
    uc = "GRAD" if status_uncompiled[name] else "-"
    cc = "GRAD" if status_compiled[name] else "-"
    marker = " <-- DIFF" if uc != cc else ""
    print(f"{name:<60} {uc:>10} {cc:>10}{marker}")

Layer	What to try	What it tests
`make_fx` only (no `torch.compile`)	Replace `torch.compile(traced, ...)` with just `traced`	Is `make_fx` the problem or `torch.compile`?
Different `torch.compile` backends	Try `eager`, `aot_eager`, `inductor`	Which backend breaks gradients?
`model.train()` vs `model.eval()` during tracing	Toggle training mode before `make_fx`	Does `create_graph=self.training` get the wrong value?
`coord.requires_grad_(True)` placement	Check if coord has grad before entering compiled graph	Is the autograd entry point correct?

# Test make_fx only (no torch.compile)
traced = make_fx(fn)(ext_coord, ext_atype, nlist, mapping, fparam, aparam)
# Use traced directly instead of torch.compile(traced)

# Test different backends
for backend in ["eager", "aot_eager", "inductor"]:
    compiled = torch.compile(traced, backend=backend, dynamic=False)
    # ... run gradient check

# Run compiled consistency test
python -m pytest source/tests/pt_expt/test_training.py::TestCompiledConsistency -v
# Run loss consistency test
python -m pytest source/tests/consistent/loss/test_ener.py -v
# Run full training smoke test
python -m pytest source/tests/pt_expt/test_training.py -v

Debug Gradient Flow

Debugging Gradient Flow in Training

When to use

Method: Per-component gradient isolation

Step 1: Write a gradient probe script

Debug Gradient Flow

Debugging Gradient Flow in Training

When to use

Method: Per-component gradient isolation

Step 1: Write a gradient probe script

Step 2: Run for each loss component in isolation

Step 3: Compare compiled vs uncompiled

Step 4: Identify affected parameters

Step 5: Bisect the cause

Common root causes

1. `create_graph=False` during tracing

2. `torch.compile` inductor backend kills second-order gradients

3. Ghost force contributions discarded

4. Virial RMSE normalization mismatch

Verification

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

Debug Gradient Flow

Debugging Gradient Flow in Training

When to use

Method: Per-component gradient isolation

Step 1: Write a gradient probe script

Debug Gradient Flow

Debugging Gradient Flow in Training

When to use

Method: Per-component gradient isolation

Step 1: Write a gradient probe script

Step 2: Run for each loss component in isolation

Step 3: Compare compiled vs uncompiled

Step 4: Identify affected parameters

Step 5: Bisect the cause

Common root causes

1. create_graph=False during tracing

2. torch.compile inductor backend kills second-order gradients

3. Ghost force contributions discarded

4. Virial RMSE normalization mismatch

Verification

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

1. `create_graph=False` during tracing

2. `torch.compile` inductor backend kills second-order gradients