LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.
When to Use
- Evaluating quality of search results or AI responses
- Scoring traces for relevance, accuracy, hallucination detection
- Batch scoring recent unscored traces
- Quality assurance on agent outputs
Usage
# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test
# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score <trace_id>
# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance
# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20
Evaluators