This skill should be used when the user asks for "LLM-as-judge evaluation", "advanced quality assessment", "multi-dimensional scoring", "pairwise comparison", "evaluate with position bias mitigation", "judge this output against criteria", or when high-stakes outputs need a rigorous, multi-pass quality assessment. Extends the base evaluation skill with pairwise comparison, position bias mitigation, self-consistency checks, and calibrated confidence scoring.
High-rigor evaluation framework using LLM-as-judge methodology. Applies systematic bias mitigation, pairwise comparison, and multi-pass scoring to produce reliable, calibrated quality assessments for high-stakes outputs.
Use advanced-evaluation when:
Use evaluation (base skill) for routine quality checks.
Apply the base evaluation rubric from the evaluation skill, but with enhanced evidence requirements:
Evidence Quality Standard:
When evaluating a single output, run TWO passes:
When comparing two outputs (A vs B):
When comparing N outputs:
#### Round-Robin Comparison
| Match | Winner | Margin | Key Differentiator |
|-------|--------|--------|--------------------|
| A vs B | [A/B/tie] | [0-2] | [specific reason] |
| A vs C | [A/C/tie] | [0-2] | [specific reason] |
| B vs C | [B/C/tie] | [0-2] | [specific reason] |
**Ranking**: [1st] > [2nd] > [3rd]
**Consensus**: [High/Medium/Low — based on margin consistency]
After scoring, ask: "If I saw this output cold without knowing the context, would I give the same scores?"
Run a quick adversarial probe:
Use these anchors to calibrate your scores against known examples:
| Score | Anchor |
|---|---|
| 5 | Wikipedia featured article quality; textbook explanation; Paul Graham essay |
| 4 | Good Stack Overflow accepted answer; solid technical blog post |
| 3 | Average README; generic ChatGPT output; first-draft document |
| 2 | Incomplete FAQ; bullet-point notes without synthesis |
| 1 | Wrong, misleading, or incoherent |
In addition to the base dimensions (Accuracy, Completeness, Usefulness, Clarity, Freshness), add:
| Dimension | Weight | Criteria |
|---|---|---|
| Coherence | bonus | Does the output have internal logical consistency? No contradictions? |
| Originality | bonus | Does it add genuine insight beyond summarizing? |
| Calibration | bonus | Are claims appropriately hedged vs stated with false certainty? |
These are bonus dimensions — include when relevant, skip when not applicable.
## Advanced Evaluation Report
**Date**: [today]
**Output evaluated**: [name/description]
**Methodology**: LLM-as-Judge with position bias mitigation
**Passes**: [1 / 2 / N]
---
### Pass 1 Scores (Forward)
| Dimension | Score | Evidence (Level 3) | Improvement |
|-----------|-------|-------------------|-------------|
| Accuracy | [1-5] | "[exact quote]" — [why this is an issue] | [specific fix] |
| Completeness | [1-5] | "[what's missing]" | [what to add] |
| Usefulness | [1-5] | "[does it achieve goal?]" | [how to improve] |
| Clarity | [1-5] | "[structural/language issues]" | [how to clarify] |
| Freshness | [1-5] | "[stale elements]" | [what to update] |
| **Subtotal** | **[X.X/5]** | | |
### Pass 2 Scores (Reverse / Alternative Framing)
[Same table with reverse-order evaluation]
### Position Bias Check
| Dimension | Pass 1 | Pass 2 | Delta | Bias Detected? |
|-----------|--------|--------|-------|----------------|
| Accuracy | | | | Yes/No |
| ... | | | | |
| **Overall bias impact**: [High/Medium/Low/None] |
### Adversarial Probing
**Steelman (case for higher score)**: [strongest argument]
**Devil's advocate (case for lower score)**: [strongest argument]
**Conclusion**: [revised score if warranted, with reasoning]
---
### Final Calibrated Score
| Dimension | Score | Confidence |
|-----------|-------|------------|
| Accuracy | [1-5] | [High/Medium/Low] |
| Completeness | [1-5] | [H/M/L] |
| Usefulness | [1-5] | [H/M/L] |
| Clarity | [1-5] | [H/M/L] |
| Freshness | [1-5] | [H/M/L] |
| **Weighted Total** | **[X.X/5]** | |
| **Grade** | **[A/B/C/D/F]** | |
### Prioritized Improvements
**MUST FIX** (blocks use):
1. [Specific issue with Level 3 evidence] → [exact fix]
**SHOULD FIX** (significant quality gain):
2. [Issue] → [fix]
**NICE TO HAVE**:
3. [Issue] → [fix]
### Recommendation
[Clear, unambiguous recommendation with reasoning]
An advanced evaluation MUST: