Skill ファイル

Advanced Evaluation (LLM-as-Judge)

Name: Advanced Evaluation (LLM-as-Judge)
Author: nsalvacao

This skill should be used when the user asks for "LLM-as-judge evaluation", "advanced quality assessment", "multi-dimensional scoring", "pairwise comparison", "evaluate with position bias mitigation", "judge this output against criteria", or when high-stakes outputs need a rigorous, multi-pass quality assessment. Extends the base evaluation skill with pairwise comparison, position bias mitigation, self-consistency checks, and calibrated confidence scoring.

nsalvacao0 スター2026/03/11

職業
カテゴリ: 機械学習

スキル内容

High-rigor evaluation framework using LLM-as-judge methodology. Applies systematic bias mitigation, pairwise comparison, and multi-pass scoring to produce reliable, calibrated quality assessments for high-stakes outputs.

When to Use This Skill vs Basic Evaluation

Use advanced-evaluation when:

Comparing two or more competing outputs (A vs B)
The output will be used in high-stakes contexts (public launch, investor materials, policy)
You suspect the evaluation might be influenced by order, framing, or recency bias
You need confidence intervals, not just point scores
You want to stress-test an output with adversarial probing

Use evaluation (base skill) for routine quality checks.

Core Methodology

1. Direct Scoring with Rubrics

Apply the base evaluation rubric from the evaluation skill, but with enhanced evidence requirements:

Evidence Quality Standard:

関連 Skill

Advanced Evaluation (LLM-as-Judge) | Skills Pool

#### Round-Robin Comparison

| Match | Winner | Margin | Key Differentiator |
|-------|--------|--------|--------------------|
| A vs B | [A/B/tie] | [0-2] | [specific reason] |
| A vs C | [A/C/tie] | [0-2] | [specific reason] |
| B vs C | [B/C/tie] | [0-2] | [specific reason] |

**Ranking**: [1st] > [2nd] > [3rd]
**Consensus**: [High/Medium/Low — based on margin consistency]

Score	Anchor
5	Wikipedia featured article quality; textbook explanation; Paul Graham essay
4	Good Stack Overflow accepted answer; solid technical blog post
3	Average README; generic ChatGPT output; first-draft document
2	Incomplete FAQ; bullet-point notes without synthesis
1	Wrong, misleading, or incoherent

Dimension	Weight	Criteria
Coherence	bonus	Does the output have internal logical consistency? No contradictions?
Originality	bonus	Does it add genuine insight beyond summarizing?
Calibration	bonus	Are claims appropriately hedged vs stated with false certainty?

## Advanced Evaluation Report

**Date**: [today]
**Output evaluated**: [name/description]
**Methodology**: LLM-as-Judge with position bias mitigation
**Passes**: [1 / 2 / N]

---

### Pass 1 Scores (Forward)

| Dimension | Score | Evidence (Level 3) | Improvement |
|-----------|-------|-------------------|-------------|
| Accuracy | [1-5] | "[exact quote]" — [why this is an issue] | [specific fix] |
| Completeness | [1-5] | "[what's missing]" | [what to add] |
| Usefulness | [1-5] | "[does it achieve goal?]" | [how to improve] |
| Clarity | [1-5] | "[structural/language issues]" | [how to clarify] |
| Freshness | [1-5] | "[stale elements]" | [what to update] |
| **Subtotal** | **[X.X/5]** | | |

### Pass 2 Scores (Reverse / Alternative Framing)

[Same table with reverse-order evaluation]

### Position Bias Check

| Dimension | Pass 1 | Pass 2 | Delta | Bias Detected? |
|-----------|--------|--------|-------|----------------|
| Accuracy | | | | Yes/No |
| ... | | | | |
| **Overall bias impact**: [High/Medium/Low/None] |

### Adversarial Probing

**Steelman (case for higher score)**: [strongest argument]
**Devil's advocate (case for lower score)**: [strongest argument]
**Conclusion**: [revised score if warranted, with reasoning]

---

### Final Calibrated Score

| Dimension | Score | Confidence |
|-----------|-------|------------|
| Accuracy | [1-5] | [High/Medium/Low] |
| Completeness | [1-5] | [H/M/L] |
| Usefulness | [1-5] | [H/M/L] |
| Clarity | [1-5] | [H/M/L] |
| Freshness | [1-5] | [H/M/L] |
| **Weighted Total** | **[X.X/5]** | |
| **Grade** | **[A/B/C/D/F]** | |

### Prioritized Improvements

**MUST FIX** (blocks use):
1. [Specific issue with Level 3 evidence] → [exact fix]

**SHOULD FIX** (significant quality gain):
2. [Issue] → [fix]

**NICE TO HAVE**:
3. [Issue] → [fix]

### Recommendation
[Clear, unambiguous recommendation with reasoning]

Advanced Evaluation (LLM-as-Judge)

When to Use This Skill vs Basic Evaluation

Core Methodology

1. Direct Scoring with Rubrics

Advanced Evaluation (LLM-as-Judge)

When to Use This Skill vs Basic Evaluation

Core Methodology

1. Direct Scoring with Rubrics

2. Position Bias Mitigation

3. Pairwise Comparison Protocol

4. Self-Consistency Check

5. Calibration Reference

Advanced Scoring Dimensions

Output Format

Quality Criteria

Anti-Patterns

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns