Core objective

Decide whether a benchmark claim is directly comparable, partially comparable, or not comparable.

Verification workflow

Identify the exact claim to verify.
Trace the claim back to the highest-quality available source.
Capture the comparison context:
- model and version
- dataset and split
- metric
- evaluation harness / script
- prompting or few-shot setup
- filtering or post-processing
- hardware / runtime context when relevant
- date / release / commit context
Compare candidates only after the contexts are aligned.
Return a verdict:
- comparable
- partially comparable
- not comparable
If the benchmark interpretation itself is a critical claim, hand off to verification-protocol for a fuller verification trail.

Benchmark claims are only useful when their context is verified.

Decide whether a benchmark claim is directly comparable, partially comparable, or not comparable.

Identify the exact claim to verify.
Trace the claim back to the highest-quality available source.
Capture the comparison context:
- model and version
- dataset and split
- metric
- evaluation harness / script
- prompting or few-shot setup
- filtering or post-processing
- hardware / runtime context when relevant
- date / release / commit context
Compare candidates only after the contexts are aligned.
Return a verdict:
- comparable
- partially comparable
- not comparable
If the benchmark interpretation itself is a critical claim, hand off to verification-protocol for a fuller verification trail.