Guide to evaluating RAG pipelines using Golden Datasets and LLM-as-a-Judge.
How do you know if your RAG system is actually good? "It feels fast" is not a metric.
RAG has two failure points:
You need a "Golden Set" of 50+ Question-Answer pairs known to be correct.
Tool: .agent/scripts/generate-eval-set.py
python .agent/scripts/generate-eval-set.py --file ./my_docs.md --count 10
This uses GPT-4o to read your docs and generate hard question-answer pairs.
We ask GPT-4o to grade Llama-3's answer.
Use tools like Ragas or DeepEval, or our internal script:
# Conceptual usage
python .agent/scripts/evaluate_model.py --test-set golden_set.jsonl --pipeline my_rag
| Need | Skill |
|---|---|
| Improve retrieval scores | rag-patterns (§Retrieval Strategies) |
| Improve generation quality | prompt-engineering |
| Fine-tune for domain-specific quality | llm-finetuning |
| Monitor eval metrics in prod | ai-observability |