⚖️ RAG Evaluation (The Truth)

How do you know if your RAG system is actually good? "It feels fast" is not a metric.

1. The Components of Quality

RAG has two failure points:

Retrieval: Did we find the right document? (Precision/Recall).
Generation: Did we answer the question correctly based on the document? (Faithfulness).

You need a "Golden Set" of 50+ Question-Answer pairs known to be correct.

Tool: .agent/scripts/generate-eval-set.py

python .agent/scripts/generate-eval-set.py --file ./my_docs.md --count 10

This uses GPT-4o to read your docs and generate hard question-answer pairs.

# Conceptual usage
python .agent/scripts/evaluate_model.py --test-set golden_set.jsonl --pipeline my_rag

Need	Skill
Improve retrieval scores	`rag-patterns` (§Retrieval Strategies)
Improve generation quality	`prompt-engineering`
Fine-tune for domain-specific quality	`llm-finetuning`
Monitor eval metrics in prod	`ai-observability`

Rag Evaluation | Skills Pool