Audit a metric or pipeline step with pre-conditions, process, and post-conditions — following the research flywheel protocol
Produce a structured audit document for a metric or pipeline step, following the flywheel protocol established in this project. Covers pre-conditions (data integrity, assumptions), process (algorithmic correctness), and post-conditions (verified output values, safe claims for the paper).
Relationship to /implement: /implement produces audit docs proactively during implementation. /audit is for re-auditing existing code (pre-/implement legacy, or when embeddings/checkpoints change and revalidation is needed).
Templates:
report/methodological-preconditions-audit.mdreport/sprints/week-15/twonn-audit.md$ARGUMENTS should name the metric or pipeline step, e.g.:
mle estimator → audits compute_intrinsic_dimensionality() in analysis/metrics.pygaussian entropy → audits compute_gaussian_entropy()fisher ratio → audits per-family SVD and Fisher discriminant ratioembedding extraction → audits the full extract_embeddings.py pipelineRead the relevant function(s) in analysis/metrics.py (or the named script).
Identify:
Read report/methodological-preconditions-audit.md to check if this metric
was already audited there. If so, note the existing status and focus on
what has changed or was deferred.
Before running anything, confirm the embeddings are from best checkpoints:
results/embedding_analysis/twonn.json for checkpoint_path metadatacheckpoint_path=N/A (known gap, issue #117), cross-reference W15 revalidation
(report/sprints/week-15/embedding-revalidation.md) to confirm the .pt files
were re-extracted from best checkpointsRun the metric pipeline against the current embedding files and capture output.
Use or adapt the pattern from scripts/compute_twonn.py:
cd src/kinship-contrastive
PYTHONPATH=. uv run python scripts/compute_<metric>.py
If no dedicated script exists, create one in scripts/ before running — never
audit from notebook output alone (notebooks are for study, pipelines are for
reproducible results).
Capture: output values for all 5 conditions (pretrained, baseline, hcl, random_scl, random_hcl), sample counts, and any diagnostic indicators.
Search for where this metric appears in reports:
grep -r "<metric_name>" report/
Check if reported values match recomputed values. If they differ, document the discrepancy — do NOT edit the original report. If correction is needed, note it as a finding in the audit.
Write to report/sprints/week-NN/<metric-slug>-audit.md.
# <Metric Name> Audit — Pre-Conditions, Process, Post-Conditions
**Date**: YYYY-MM-DD
**Scope**: `<function_name>()` in `analysis/metrics.py` [or named script]
**Reference**: `report/methodological-preconditions-audit.md` (global),
`report/sprints/week-15/twonn-audit.md` (per-metric template)
**Verified values**: `results/embedding_analysis/<metric>.json` (recomputed YYYY-MM-DD)
---
## Context
[1-2 paragraphs: why this metric exists in the pipeline, what question it answers,
why the audit was triggered now.]
---
## Pre-Conditions
| # | Pre-condition | Status | Evidence |
|---|--------------|--------|----------|
| P1 | Embeddings from best checkpoints | ✅/⚠️/❌ | [source] |
| P2 | [Data assumption] | ✅/⚠️/❌ | [source] |
| P3 | [Numerical assumption] | ✅/⚠️/❌ | [source] |
| ... | | | |
Status key: ✅ Verified | ⚠️ Assumed/partial | ❌ Known gap
---
## Process
| # | Step | Status | Note |
|---|------|--------|------|
| E1 | [Algorithm step] | ✅/⚠️/❌ | [correctness note] |
| E2 | [Edge case handling] | ✅/⚠️/❌ | [note] |
| ... | | | |
---
## Post-Conditions (Verified Results — YYYY-MM-DD)
[Results table: all 5 conditions × relevant output values]
| # | Post-condition | Status | Evidence |
|---|---------------|--------|----------|
| Q1 | [Expected pattern across conditions] | ✅/⚠️/❌ | [value or reference] |
| Q2 | [Consistency check] | ✅/⚠️/❌ | [note] |
| ... | | | |
---
## Additional Diagnostics
[Only present if you ran beyond-template analyses (null models, sensitivity,
confound checks). Open with a scope note: which design principle triggered
each diagnostic; whether it overlaps with deferred diagnostics from prior audits.
If nothing beyond-template was run, omit this section entirely.]
---
## Known Limitations
| # | Limitation | Severity | Action |
|---|-----------|----------|--------|
| L1 | [assumption not tested] | Low/Med/High | [defer/fix/accept] |
| ... | | | |
---
## Safe Claim for SIBGRAPI
> "[One-sentence claim the paper can make, with values.]"
**Do not claim**: [what cannot be asserted and why]
---
## Reproduction
```bash
cd src/kinship-contrastive
PYTHONPATH=. uv run python scripts/compute_<metric>.py
# Output: results/embedding_analysis/<metric>.json
Commit: <hash> (YYYY-MM-DD)
---
## Step 7: Commit and Link
```bash
git add report/sprints/week-NN/<metric>-audit.md
git commit -m "docs: add <metric> audit (WNN)"
If the audit reveals a discrepancy with a published report, also create a
GitHub issue with label investigation describing the finding.
If the audit resolves or updates an open item in the global audit
(report/methodological-preconditions-audit.md), update its status row.
Mark resolved items ✅ with a reference to the new audit document.
results/embedding_analysis/ is the source of truth; the audit certifies it## Additional Diagnostics
section after Post-Conditions. Open the section with a scope note: which design principle
triggered each diagnostic, and whether any overlaps with deferred diagnostic gaps listed in
prior audits (note they are separate if so).