Use when implementing or running SBC (simulation-based calibration) validation, calibration diagnostics, coverage analysis, or quality threshold checks for BayesFlow models. Triggers on: SBC, simulation-based calibration, validation, coverage, calibration error, C2ST, condition grid, quality thresholds, diagnostic plots, validate model, check calibration.
Simulation-Based Calibration (SBC) validates neural posterior estimators by checking that credible intervals achieve their nominal coverage. A well- calibrated model's HPD ranks should be uniformly distributed.
Test the model across its full design space, not just training conditions:
from bayesflow_hpo import make_condition_grid
# Factorial grid of experimental conditions
conditions = make_condition_grid(
n_total=[50, 100, 200, 500],
effect_size=[0.1, 0.3, 0.5, 0.8],
allocation_ratio=[0.3, 0.5, 0.7],
)
# Returns list of condition dicts, one per grid point (4 * 4 * 3 = 48)
For strict validation (IRT / RCT pattern):
from bayesflow_rct import create_strict_validation_grid
# 144-condition grid for thorough parametric coverage testing
conditions = create_strict_validation_grid(
n_values=[50, 100, 200, 500],
effect_values=[0.1, 0.3, 0.5],
alloc_values=[0.3, 0.5, 0.7],
prior_df_values=[3, 10, 30, 100],
)
Generate validation data ONCE and reuse across HPO trials:
from bayesflow_hpo import ValidationDataset, generate_validation_dataset
# Generate and cache
val_data = generate_validation_dataset(
simulator=simulator,
adapter=adapter,
conditions=conditions,
n_sims_per_condition=500,
n_post_draws=500,
seed=42,
)
# Save / load for cross-session reuse
save_validation_dataset(val_data, "data/validation/")
val_data = load_validation_dataset("data/validation/")
For IRT variable-size data:
from bayesflow_irt import build_multicondition_val_dataset
val_dataset = build_multicondition_val_dataset(
simulator, adapter,
num_conditions=50, # Different (I, P) configs
samples_per_condition=4, # Samples per config
)
# Use as: fit_kwargs=dict(validation_data=val_dataset)
| Metric | What it measures | Good value |
|---|---|---|
| Calibration error | Deviation of rank distribution from uniform | < 0.02 |
| C2ST deviation | Classifier two-sample test (ranks vs uniform) | < 0.52 |
| Coverage error @ 90% | observed coverage - 0.90 | |
| Coverage error @ 95% | observed coverage - 0.95 |
from bayesflow_hpo import run_validation_pipeline, ValidationResult