Name: EAROS Calibrate Skill
Author: ThomasRohde

EAROS Calibrate Skill

Run EAROS calibration exercises to validate rubric reliability before production use. Use this skill whenever someone wants to calibrate a rubric, validate inter-rater reliability, compare scores against gold-standard artifacts, measure scoring consistency, or says "calibrate this rubric", "run calibration", "check if the rubric is reliable", "compare my scores to the gold set", "test this profile against examples", "is this rubric ready for production", "what is our kappa", "measure agreement between reviewers", "validate a new profile", or "how well does the rubric score consistently". Calibration is required before any new profile can move from draft to candidate status.

ThomasRohde0 星标2026年3月22日

职业
分类: 代码质量

You are running an EAROS calibration exercise. Calibration validates that a rubric produces consistent, reliable scores across reviewers and artifacts before it enters a governance process.

Why calibration matters: A rubric that produces inconsistent scores is not a quality gate — it is noise. Without calibration, two reviewers applying the same rubric will score the same artifact differently, governance decisions will be arbitrary, and the framework loses credibility. Calibration makes the rubric trustworthy by measuring and improving its reproducibility.

Target reliability metrics:

Binary agreement (exact match): > 95%
Ordinal Cohen's κ: > 0.70 for well-defined criteria; > 0.50 for subjective criteria
Spearman ρ (overall score correlation across artifacts): > 0.80

Critical: Do NOT look at gold-set benchmark scores until after completing your independent assessment. True calibration requires independent scoring first.

Step 0 — Load Calibration Inputs

Read these files:

core/core-meta-rubric.yaml
The profile or overlay being calibrated (ask if not specified; scan and )

EAROS Calibrate Skill

ThomasRohde0 星标2026年3月22日

职业
分类: 代码质量

You are running an EAROS calibration exercise. Calibration validates that a rubric produces consistent, reliable scores across reviewers and artifacts before it enters a governance process.

Target reliability metrics:

Binary agreement (exact match): > 95%
Ordinal Cohen's κ: > 0.70 for well-defined criteria; > 0.50 for subjective criteria
Spearman ρ (overall score correlation across artifacts): > 0.80

Critical: Do NOT look at gold-set benchmark scores until after completing your independent assessment. True calibration requires independent scoring first.

Step 0 — Load Calibration Inputs

Read these files:

core/core-meta-rubric.yaml
The profile or overlay being calibrated (ask if not specified; scan and )

EAROS Calibrate Skill

Step 0 — Load Calibration Inputs

EAROS Calibrate Skill

Step 0 — Load Calibration Inputs

Step 1 — Artifact Inventory

Step 2 — Independent Scoring

Step 3 — Score Comparison

Openclaw Release Maintainer

Verify

Flow

Fix

Hygiene

Add Policy