Archivo del skill

PINN Training Debugger

Name: PINN Training Debugger
Author: Albatross679

Systematic PINN training debugger. Four-phase diagnostic process for identifying and fixing physics-informed neural network training failures. Use when (1) loss not decreasing or stuck, (2) physics loss dominating or data loss dominating, (3) gradient pathology between loss components, (4) NaN in training, (5) user says "debug PINN", "loss stuck", "why isn't it learning", (6) validate PINN pipeline before real training, (7) run probe PDEs, (8) user mentions "loss ratio", "NTK eigenvalues", "ReLoBRaLo", "spectral bias", "Fourier features", "collocation", "residual", or any PINN diagnostic metric, (9) pre-flight check before PINN training, (10) physics residual debugging, per-component violation analysis, nondimensionalization quality, (11) convergence rate mismatch, causality violation, collocation starvation, (12) ansatz numerical issues, DD-PINN-specific debugging. Works with DD-PINN (train_pinn.py) and physics-regularized surrogates (train_regularized.py).

Albatross6790 estrellas26 mar 2026

Ocupación
Categorías: Machine Learning

Contenido de la habilidad

Systematic 4-phase diagnostic process for physics-informed neural network training failures. Do not tune hyperparameters until bugs are ruled out. Do not look at total loss alone -- total loss hides whether physics or data loss is decreasing.

Phase 1: Probe PDE Validation

Run BEFORE any real training. Catches implementation bugs in seconds using simple PDEs with known analytical solutions.

Probe PDEs location: src/pinn/probe_pdes.py

Probe	PDE	Tests	Pass Criterion	Failure Means
probe1_heat_1d	1D heat: u_t = alpha * u_xx	Data fitting + optimizer	MSE < 1e-4 in ~500 epochs	Broken data loss or backprop
probe2_advection_1d	1D advection: u_t + c * u_x = 0	BC/IC enforcement + PDE residual	Residual < 1e-3, BC error < 1e-4	Broken BC enforcement or residual computation
probe3_burgers_1d	1D Burgers: u_t + u * u_x = nu * u_xx	Nonlinear PDE + loss balancing

Skills relacionados

PINN Training Debugger | Skills Pool

Archivo del skill

PINN Training Debugger

Albatross6790 estrellas26 mar 2026

Ocupación
Categorías: Machine Learning

Contenido de la habilidad

Phase 1: Probe PDE Validation

Run BEFORE any real training. Catches implementation bugs in seconds using simple PDEs with known analytical solutions.

Probe PDEs location: src/pinn/probe_pdes.py

Probe	PDE	Tests	Pass Criterion	Failure Means
probe1_heat_1d	1D heat: u_t = alpha * u_xx	Data fitting + optimizer	MSE < 1e-4 in ~500 epochs	Broken data loss or backprop
probe2_advection_1d	1D advection: u_t + c * u_x = 0	BC/IC enforcement + PDE residual	Residual < 1e-3, BC error < 1e-4	Broken BC enforcement or residual computation
probe3_burgers_1d	1D Burgers: u_t + u * u_x = nu * u_xx	Nonlinear PDE + loss balancing

Skills relacionados

from src.pinn.probe_pdes import run_probe_validation
run_probe_validation()

from src.pinn.probe_pdes import analyze_pde_system
print(analyze_pde_system())

python3 -m pytest tests/test_pinn_probes.py -x -q

loss_phys >> loss_data (ratio > 100)?
  -> Physics loss dominating. ReLoBRaLo not working or warmup too short.
     Check: Is ReLoBRaLo enabled? Is curriculum_warmup > 0?
     Fix: Increase curriculum_warmup from 0.15 to 0.3, verify ReLoBRaLo alpha,
     try fixed lambda_phys=0.01 as baseline test.

loss_data >> loss_phys (ratio > 100)?
  -> Data overfitting, physics being ignored.
     Check: Is lambda_phys > 0? Is n_collocation sufficient (>= 5000)?
     Fix: Increase lambda_phys, increase n_collocation, check collocation
     point coverage spans the full domain.

grad_norm_phys >> grad_norm_data (ratio > 100)?
  -> Gradient pathology: physics gradients dominating parameter updates.
     This is the NTK eigenvalue mismatch (Wang et al. NeurIPS 2021).
     Fix: Use per-loss gradient clipping. Clip physics gradient norm to
     match data gradient norm magnitude.

grad_norm_phys ~= 0?
  -> Physics loss not producing gradients.
     Check: CosseratRHS computation (is output differentiable?).
     Check: Collocation points have requires_grad=True.
     Check: No accidental .detach() or .item() in residual computation.
     Check: Physics residual is not returning constant zero.

residual_max >> residual_mean (ratio > 100)?
  -> Residual concentrated in specific regions. RAR should help.
     Check: Is use_rar=True?
     Fix: If RAR already enabled, increase rar_fraction from 0.1 to 0.2,
     decrease rar_interval from 20 to 10.

val_loss decreasing but loss_phys flat?
  -> Data fitting working but physics not being enforced.
     Check: Is curriculum warmup still active? Compare current epoch
     vs warmup_end = n_epochs * curriculum_warmup.
     Fix: If warmup ended and loss_phys still flat, physics residual
     computation may be broken. Run probe PDEs to isolate.

loss_data has sudden jumps?
  -> Learning rate too high, or ReLoBRaLo weight oscillation.
     Check: diagnostics/relobralo_ratio -- is it oscillating?
     Fix: Reduce lr. Increase ReLoBRaLo alpha from 0.999 to 0.9999
     (more exponential smoothing).

All metrics healthy but accuracy poor?
  -> Spectral bias. Network cannot represent high-frequency solution components.
     Check: Fourier feature sigma and n_fourier.
     Fix: If n_fourier=0, enable Fourier features. If already enabled,
     increase fourier_sigma (10 -> 30), increase n_fourier (128 -> 256).
     Follow Phase 4 sub-tree for detailed physics accuracy analysis.

Everything NaN?
  -> Division by zero in physics residual (dl=0 in Cosserat bending,
     norm=0 in tangent computation), log(0) in loss computation,
     or inf in state normalization.
     Check: NondimScales -- are reference scales nonzero?
     Check: Normalizer fit data -- was normalizer fitted on valid data?
     Check: Ansatz parameters -- is delta constrained positive (via softplus)?

1.1 Per-equation residual decomposition.
    Log residuals separately for kinematic, bending, friction terms.
    One component > 90% of total residual?
      -> That physics term is hardest to learn. Weight it higher.

1.2 Residual spatial distribution.
    Plot |residual| vs collocation point position/time.
    Concentrated at boundaries?
      -> BC enforcement weak. Increase boundary collocation points.
    Concentrated at t=0?
      -> IC satisfaction failing. Check ansatz g(a,0)=0 property.
    Concentrated at late times?
      -> Causality violation. Increase curriculum_warmup or add causal weighting.

1.3 Collocation point coverage.
    Plot collocation points vs residual magnitude.
    High-residual regions under-sampled?
      -> RAR should fix this. If already enabled, increase rar_fraction.

1.4 Per-component error analysis.
    RMSE per state component (pos_x, vel_x, yaw, omega_z).
    Which component has highest error in physical units?
      -> Focus debugging on that component's PDE terms.

2.1 Fourier feature analysis.
    Compute FFT of predicted solution along spatial dimension.
    Missing high-frequency modes?
      -> Increase fourier_sigma or n_fourier.
    High-frequency noise but low-frequency wrong?
      -> Decrease fourier_sigma (spectral bias overcorrected).

2.2 Ansatz verification.
    Evaluate ansatz at t=0 for 1000 random inputs.
    Any |g(a,0)| > 1e-6?
      -> Ansatz IC satisfaction broken. Check DampedSinusoidalAnsatz.
    Evaluate dg/dt at t=0. Reasonable magnitudes?
      -> If too large, ansatz basis functions too aggressive.
         Clamp delta > 0 via softplus, clamp beta to [-100, 100].

2.3 Network capacity.
    Train on 10% of data. Achieves low training loss?
      Yes -> Network has capacity, training data or physics is the issue.
      No -> Network too small. Increase hidden_dim or n_layers.

3.1 CosseratRHS verification.
    Run CosseratRHS on known analytical states.
    Compare output to PyElastica reference.
    Discrepancy > 10%?
      -> CosseratRHS parameters wrong. Check elastic modulus, friction coefficients.

3.2 Nondimensionalization check.
    Run: analyze_pde_system() from src/pinn/probe_pdes.py.
    Are all terms in physics residual O(1)?
    Any term >> 1 or << 1?
      -> NondimScales needs adjustment. Adjust L_ref, t_ref, F_ref until
         all per_term_magnitudes are within [0.1, 10].

3.3 Stiff PDE detection.
    Compute condition number of Jacobian df/dx.
    Condition number > 1e6?
      -> Problem is stiff. Consider implicit time integration or
         sequence-to-sequence training with shorter time windows.

Symptom	Start At
Loss flat from epoch 0	Phase 1 (probe PDEs) + Phase 3 (grad_norm_phys ~= 0?)
Physics loss drops, data loss stuck	Phase 3 (loss_phys >> loss_data)
Data loss drops, physics loss stuck	Phase 3 (loss_data >> loss_phys)
Loss oscillates wildly	Phase 3 (loss_data has sudden jumps?)
High loss but smooth convergence	Phase 4 (spectral bias) -- Tier 2 Fourier analysis
NaN after N epochs	Phase 3 (Everything NaN?)
Good loss but bad predictions	Phase 4 Tier 1 (per-component residual analysis)
Validation loss increasing	Overfitting -- reduce network capacity or increase regularization

PINN Training Debugger

Phase 1: Probe PDE Validation

PINN Training Debugger

Phase 1: Probe PDE Validation

Phase 2: Dashboard Diagnostic Metrics

Priority 1: Loss component balance

Priority 2: Gradient health

Priority 3: Residual distribution

Priority 4: ReLoBRaLo weight health

Priority 5: NTK eigenvalue spectrum

Priority 6: Per-component violations

Phase 3: Loss Not Decreasing -- Decision Tree

Phase 4: Physics Accuracy Poor -- Sub-Tree

Tier 1: PDE Residual Analysis (cheapest to check)

Tier 2: Network Architecture (check second)

Tier 3: Physics Model Fidelity (hardest to fix)

Quick Symptom Lookup

Key Principle

Failure Mode Reference

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns