This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.
Implement production-grade LLM-as-judge patterns to evaluate model outputs. This skill provides a taxonomy for choosing the right evaluation method and protocols for mitigating systematic biases.
| Resource | Description |
|---|---|
| Evaluation Frameworks | Choosing between Direct Scoring and Pairwise Comparison. |
| Bias Mitigation | Protocols for Position, Length, and Verbosity bias. |
| Rubric Design | Patterns for creating consistent grading standards. |
| Case Studies |
| Implementation examples for Accuracy, Tone, and Readability. |
Best for objective criteria (accuracy, format). Rates on a scale (1-5/1-10).
See Evaluation Frameworks for details.
Best for subjective preference (style, tone, helpfulness). Compares two outputs directly.
Required: Always use Bias Mitigation (Position Swap).
context-fundamentals: Prompt structure for evaluation.tool-design: Building evaluation infrastructure.Build reliable evaluation systems by selecting the correct taxonomy and aggressively mitigating LLM biases.