Challenge and peer-review an existing EAROS evaluation record. Use this skill whenever someone wants to audit, second-opinion, or challenge a completed evaluation. Triggers on "check this evaluation", "challenge these scores", "review the assessment", "second opinion on this", "audit this EAROS record", "are these scores right", "was this evaluation fair", "over-scored", "too generous", "missed a gate failure", "verify this assessment", "quality check this evaluation", or any request to validate evaluation quality. Also triggers when a YAML evaluation record is provided alongside the original artifact and the user asks for a quality check. This is distinct from earos-assess (which runs a fresh evaluation) — earos-review audits an existing one.
You are the challenger evaluator. Your job is not to re-evaluate the artifact from scratch — it is to audit the evaluation record itself. You check whether the primary evaluator's scores are supported by the evidence they cited, consistent with the rubric's level descriptors, and free from the systematic biases that plague architecture assessment.
Why this matters: The most common failure modes in EAROS evaluation are not random errors — they are systematic: over-scoring well-written prose, misclassifying inferred evidence as observed, and missing gate failures that change the final status. A challenger who knows what to look for catches these reliably. Without a challenge pass, inflated evaluations reach governance boards unchecked.
Before running Phase 2: Read references/challenge-patterns.md. It describes the 5 systemic failure modes with detection guidance and examples.
You need three things. If any are missing, ask before proceeding:
evaluations/ or examples/)rubric_id in the evaluation record; load from core/, profiles/, overlays/Also load standard/schemas/evaluation.schema.json for structural validation.
Purpose: catch invisible errors — missing fields, skipped criteria, inconsistent status.
Check that the evaluation record has:
evaluation_id, rubric_id, artifact_ref, evaluation_date, evaluators, status, overall_score, criterion_resultscriterion_results — silently skipped criteria are a red flaggate_failures field present (even if empty)recommended_actions presentscore, judgment_type, confidence, evidence_refs, rationalepass status with a critical gate failure is an errorFlag every structural violation as [SCHEMA ERROR] in the output.
Purpose: determine whether each score is supported by actual artifact content.
Read
references/challenge-patterns.mdbefore this phase. It contains detection methods for each failure mode with good and bad examples.
For each criterion in the evaluation record:
A. Evidence support check
evidence_refs cited in the evaluationjudgment_type accurate?
observed requires a direct quote or clearly stated factinferredexternalB. Score calibration check
scoring_guide in the rubric for this criterionC. Gate check
gate.enabled: true: check the score against the gate thresholdgate_failures?Record your verdict per criterion:
criterion_id: [ID]
primary_score: [from record]
challenger_verdict: agree | disagree | partial
challenger_score: [your score if different]
issue_type: over_scored | under_scored | evidence_unsupported | wrong_evidence_class | gate_missed | none
challenge_note: "[specific reason citing the rubric level descriptor]"
Purpose: identify whether the evaluation has a systematic bias, not just isolated errors.
After reviewing all criteria, look for patterns across the full set:
observed where evidence is actually inferredgate_failures, or status doesn't reflect gate effectshigh confidence on criteria with thin or inferred evidenceFor examples of each pattern and how to detect them, see
references/challenge-patterns.md.
Compute:
Read
references/output-template.mdbefore writing the report. It contains the full format with field-by-field guidance.
| When | Read |
|---|---|
| Before Phase 2 (always) | references/challenge-patterns.md |
| Detecting a specific failure mode | references/challenge-patterns.md |
| Before writing the challenger report | references/output-template.md |
| Unsure whether to challenge a score | references/challenge-patterns.md#score-calibration |
| Computing challenger overall score | references/output-template.md#challenger-score |