Retrieval-Augmented Generation pipelines — ingestion, chunking, embedding, vector stores, retrieval, evaluation. Use when building a RAG pipeline, choosing chunking strategies or embedding models, debugging retrieval quality or hallucinations, evaluating an existing RAG system, or scaling/migrating vector stores.
A RAG pipeline has two phases: indexing (offline) and retrieval+generation (online). Decisions cascade — bad chunking ruins retrieval no matter how good your embeddings are, and bad retrieval ruins generation no matter how strong your LLM is. Start from the data and work forward, and measure with a golden eval set from day one.
Indexing: Documents -> Parse -> Chunk -> Embed -> Vector DB
Query: Question -> Embed -> Retrieve top-k -> Prompt -> LLM
Parse source documents into clean text with metadata preserved.
pymupdf4llm or unstructured. Preserve headings,
tables, page numbers. OCR fallback for scanned docs via
tesseract.trafilaturareadability-lxmlPick based on document structure and query patterns. See
references/chunking-strategies.md for code and trade-offs.
\n\n,
then \n, then space.Rule of thumb: chunk size should match expected answer granularity. FAQ -> small chunks. Long technical explanations -> larger chunks.
Pick on domain match, dimensionality, speed, and cost.
| Model | Dims | Context | Notes |
|---|---|---|---|
text-embedding-3-small | 1536 | 8191 | Good default, cheap |
text-embedding-3-large | 3072 | 8191 | Better quality, 2x cost |
nomic-embed-text | 768 | 8192 | Open source, strong MTEB |
bge-large-en-v1.5 | 1024 | 512 | Open source, good for code |
voyage-code-2 | 1536 | 16000 | Best for code retrieval |
Always benchmark on your queries before committing. MTEB leaderboard scores don't predict domain-specific performance. For significant gains on a tight domain, fine-tune with synthetic query/passage pairs and contrastive loss — even 1000 pairs helps.
For most projects: start with Chroma locally, move to Qdrant or pgvector for production.
bge-reranker-v2-m3, Cohere Rerank).See references/retrieval-patterns.md for the architecture
diagram and per-pattern code.
The retrieval-to-generation handoff is where many RAG systems fail.
[1], [2]; instruct the LLM
to cite; verify citations in post-processing.Key metrics: context precision (are retrieved chunks relevant?), context recall (are all needed chunks retrieved?), faithfulness (is the answer grounded in context?), answer relevance (does it address the question?).
Use the RAGAS framework. Build a golden dataset of 50–100
question/ideal-context/reference-answer triples from your real
documents. See references/evaluation.md for metric details and
debugging workflow.
Common failure modes:
Agent-specific failure modes — provider-neutral pause-and-self-check items:
Reciprocal Rank Fusion is the standard combiner for hybrid retrieval — no score-weight tuning needed:
RRF(d) = sum( 1 / (k + rank_i(d)) ) for each retrieval method i
(k typically 60)
Hybrid + RRF outperforms either dense or sparse alone on nearly every benchmark and should be the default for production. Vector stores with native hybrid support: Qdrant, Weaviate, Elasticsearch. For others, run both searches and fuse in application code.
Bi-encoders (embedding models) score query and document
independently. Cross-encoders score (query, document) jointly —
much more accurate but too slow for full-corpus search.
Stage 1: Retrieve top-50 with fast method (dense, hybrid)
Stage 2: Rerank to top-5 with cross-encoder
Models: bge-reranker-v2-m3 (open, strong), Cohere Rerank (API,
easy), ms-marco-MiniLM-L-12-v2 (fast, lighter). Latency is
typically 100–500ms for 50 candidates.
MMR = argmax[ lambda * sim(q, d) - (1 - lambda) * max(sim(d, d_selected)) ]
Greedy: pick the highest-scoring candidate, repeat. Lambda 1.0 = pure relevance, 0.0 = pure diversity. Start at 0.6. Lower lambda when chunks are highly redundant.
Embed at 256 tokens (precise matching) but store a mapping to a 1024-token parent (richer context). Match on child embeddings, return the parent. Small chunks find precise matches; large chunks give the LLM enough context to answer well.
Use a small/fast LLM (Haiku, GPT-4o-mini) to rewrite the query into 3–5 variants, retrieve top-10 for each, deduplicate by document ID, fuse with RRF. Adds ~200–500ms of latency. Best for ambiguous or broad queries.
Pre-filter candidates before or during vector search to reduce search space and improve precision.
results = vector_store.similarity_search(
query_embedding, k=10,
filter={"source": "api-docs", "version": ">=3.0"},
)
Useful filters: document type, date range, language, author, product version, access control. Pre-filter > post-filter when the index supports it.
"Title: {title}\nSection: {heading}\n\n" so retrieval matches
queries that reference titles or sections.| Metric | Target |
|---|---|
| Context precision | > 0.8 |
| Context recall | > 0.7 |
| Faithfulness | > 0.9 (non-negotiable for production) |
| Answer relevancy | > 0.8 |
| Symptom | Likely cause | Fix |
|---|---|---|
| Wrong docs retrieved | Vocabulary mismatch | Add hybrid search |
| Right doc exists, not retrieved | Chunk too large, answer buried | Smaller chunks, parent-doc retrieval |
| Top-1 right, rest noise | No reranking | Add cross-encoder reranker |
| Same info repeated | Overlapping chunks, no dedup | MMR or dedup pass |
| Answer contradicts context | Hallucination / model prior | Stronger grounding prompt, better model |
| Vague generic answer | Context not used effectively | Reorder context, improve template |
| Refuses despite good context | Over-cautious system prompt | Relax prompt, check conflicts |
| Answers part of question | Low context recall | Multi-query, smaller chunks |
Golden rule: never tune the prompt to fix a retrieval problem, and never tune retrieval to fix a prompt problem. Isolate the variable.
Basic RAG over internal docs. Recursive chunking at 512
tokens, text-embedding-3-small, Chroma. Dense retrieval, top-5.
Build an eval set from real user questions. Iterate from this
baseline.
RAG misses relevant info. Diagnose context recall — what was retrieved vs. what should have been? Add hybrid search (BM25 + dense), add cross-encoder reranking, try smaller chunk sizes. Test each change against the eval set.
Benchmarking a RAG system. Build a golden dataset: 50 questions, manually identified ideal context passages, reference answers. Run RAGAS. Identify the weakest metric and focus there. Set up automated eval in CI to catch regressions.