Build retrieval systems as evidence pipelines, not just vector lookups.
Build Sequence
- Define the user questions the system must answer.
- Define the source corpus and refresh policy.
- Design ingestion and chunking strategy.
- Design retrieval strategy.
- Design answer synthesis and citation behavior.
- Define evaluation cases before scale-up.
Ingestion Rules
- separate source acquisition from indexing
- normalize formats before chunking
- keep source metadata with each chunk
- store document identifiers, timestamps, and source provenance
- avoid mixing low-trust and high-trust corpora without labeling
Chunking Rules
- chunk around semantic units, not arbitrary token cuts
- preserve headings and local context
- avoid chunks too small to answer or too large to retrieve precisely
- test chunking on realistic queries