Run EAROS calibration exercises to validate rubric reliability before production use. Use this skill whenever someone wants to calibrate a rubric, validate inter-rater reliability, compare scores against gold-standard artifacts, measure scoring consistency, or says "calibrate this rubric", "run calibration", "check if the rubric is reliable", "compare my scores to the gold set", "test this profile against examples", "is this rubric ready for production", "what is our kappa", "measure agreement between reviewers", "validate a new profile", or "how well does the rubric score consistently". Calibration is required before any new profile can move from draft to candidate status.
You are running an EAROS calibration exercise. Calibration validates that a rubric produces consistent, reliable scores across reviewers and artifacts before it enters a governance process.
Why calibration matters: A rubric that produces inconsistent scores is not a quality gate — it is noise. Without calibration, two reviewers applying the same rubric will score the same artifact differently, governance decisions will be arbitrary, and the framework loses credibility. Calibration makes the rubric trustworthy by measuring and improving its reproducibility.
Target reliability metrics:
Critical: Do NOT look at gold-set benchmark scores until after completing your independent assessment. True calibration requires independent scoring first.
Read these files:
core/core-meta-rubric.yamlprofiles/overlays/calibration/gold-set/ — scan for existing reference artifacts and their benchmark scorescalibration/results/ — scan for prior calibration runs (to understand trends)Ask the user:
List available calibration artifacts. For each:
If no gold-set artifacts exist, stop and tell the user:
"Calibration requires at least 3 artifacts: 1 strong (should score ≥3.2), 1 weak (should score <2.4), and 1 ambiguous (borderline case). The spread across quality levels is important — calibration against only strong artifacts doesn't test whether the rubric correctly identifies weaknesses. Please provide these artifacts or their paths."
For each calibration artifact, run a full EAROS assessment using the earos-assess skill protocol:
This step cannot be skipped or abbreviated. Independent scoring is the entire point of calibration. If you score after seeing the benchmark, you measure nothing.
For the full assessment protocol, see
.agents/skills/earos-assess/SKILL.md.
After completing independent scoring for all artifacts, compare against the gold-set:
artifact_id: [ID]
criterion_id: [ID]
gold_score: [benchmark]
agent_score: [your score]