Name: Advanced Evaluation
Author: muratcankoylan

スキルを検索.../

Advanced Evaluation | Skills Pool

Task Type	Primary Metrics	Secondary Metrics
Binary classification (pass/fail)	Recall, Precision, F1	Cohen's kappa
Ordinal scale (1-5 rating)	Spearman's rho, Kendall's tau	Cohen's kappa (weighted)
Pairwise preference	Agreement rate, Position consistency	Confidence calibration
Multi-label	Macro-F1, Micro-F1	Per-label precision/recall

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

You are an expert evaluator assessing response quality.

## Task
Evaluate the following response against each criterion.

## Original Prompt
{prompt}

## Response to Evaluate
{response}

## Criteria
{for each criterion: name, description, weight}

## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement

## Output Format
Respond with structured JSON containing scores, justifications, and summary.

You are an expert evaluator comparing two AI responses.

## Critical Instructions
- Do NOT prefer responses because they are longer
- Do NOT prefer responses based on position (first vs second)
- Focus ONLY on quality according to the specified criteria
- Ties are acceptable when responses are genuinely equivalent

## Original Prompt
{prompt}

## Response A
{response_a}

## Response B
{response_b}

## Comparison Criteria
{criteria list}

## Instructions
1. Analyze each response independently first
2. Compare them on each criterion
3. Determine overall winner with confidence level

## Output Format
JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.

Is there an objective ground truth?
+-- Yes -> Direct Scoring
|   Examples: factual accuracy, instruction following, format compliance
|
+-- No -> Is it a preference or quality judgment?
    +-- Yes -> Pairwise Comparison
    |   Examples: tone, style, persuasiveness, creativity
    |
    +-- No -> Consider reference-based evaluation
        Examples: summarization (compare to source), translation (compare to reference)

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5 degrees) for completeness."
}

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

{ "winner": "B", "confidence": 0.8 }

{ "winner": "A", "confidence": 0.6 }

{ "winner": "B", "confidence": 0.6 }

{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}

criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"

Advanced Evaluation

When to Activate

Core Concepts

The Evaluation Taxonomy

Advanced Evaluation

When to Activate

Core Concepts

The Evaluation Taxonomy

The Bias Landscape

Metric Selection Framework

Evaluation Approaches

Direct Scoring Implementation

Pairwise Comparison Implementation

Rubric Generation

Practical Guidance

Evaluation Pipeline Design

Decision Framework: Direct vs. Pairwise

Scaling Evaluation

Examples

Example 1: Direct Scoring for Accuracy

Example 2: Pairwise Comparison with Position Swap

Example 3: Rubric Generation

Coding Agent (bash-first)

Feishu Wiki

Gemini

Goplaces

Sherpa Onnx Tts

Openai Whisper Api