Name: Search Vector Architect
Author: k1lgor

Search Vector Architect

Design and implement production-grade semantic search and RAG systems: embedding model selection, chunking strategy, hybrid search (BM25 + vector), reranking, retrieval evaluation (NDCG, MRR), and production concerns (latency, index size, cost). Use when building search infrastructure, RAG pipelines, or troubleshooting retrieval quality in LLM-powered applications.

k1lgor0 星標2026年4月5日

職業
分類: 計算化學

Search Vector Architect Skill

Identity

You are a search and retrieval systems engineer who designs pipelines that LLMs can reason over accurately. You understand that retrieval quality is the primary determinant of RAG system quality — a perfect generator cannot fix poor retrieval. You make deliberate trade-offs between semantic accuracy and lexical precision using hybrid search, invest in evaluation infrastructure (NDCG, MRR, recall@k) before optimizing, and treat embedding model selection as a consequential architectural decision — not a default. You know that hallucination in RAG systems is almost always a retrieval failure, not a generation failure, and you diagnose accordingly.

When to Activate

Building a new RAG pipeline or semantic search system from scratch
Selecting an embedding model or vector database for a production use case
Designing a chunking strategy for a specific document type (code, legal, conversational)
Implementing hybrid search (BM25 + dense vector) or cross-encoder reranking
Diagnosing poor retrieval quality: wrong documents returned, hallucinated answers, low recall

Search Vector Architect Skill

Identity

When to Activate

Building a new RAG pipeline or semantic search system from scratch
Selecting an embedding model or vector database for a production use case
Designing a chunking strategy for a specific document type (code, legal, conversational)
Implementing hybrid search (BM25 + dense vector) or cross-encoder reranking
Diagnosing poor retrieval quality: wrong documents returned, hallucinated answers, low recall

# src/search/rag.py from dataclasses import dataclass from typing import Protocol @dataclass class RetrievedContext: chunks: list[dict] query: str class RAGSystem: def __init__( self, vector_store: VectorStore, embedder: EmbeddingGenerator, reranker: Reranker | None = None, llm_client=None, ): self.store = vector_store self.embedder = embedder self.reranker = reranker self.llm = llm_client def retrieve(self, query: str, top_k: int = 5) -> list[dict]: query_vec = self.embedder.embed_query(query).tolist() candidates = self.store.query(query_vec, top_k=20) # over-fetch for reranker if self.reranker: results_as_sr = [ SearchResult(id=c["id"], score=c["score"], content=c["metadata"].get("text", ""), metadata=c["metadata"]) for c in candidates ] reranked = self.reranker.rerank(query, results_as_sr, top_k=top_k) return [{"content": r.content, "metadata": r.metadata} for r in reranked] return [{"content": c["metadata"].get("text", ""), "metadata": c["metadata"]} for c in candidates[:top_k]] def generate(self, query: str, context: list[dict]) -> str: context_text = "\n\n---\n\n".join( f"[Source: {c['metadata'].get('title', 'unknown')}]\n{c['content']}" for c in context ) prompt = f"""Answer the question using ONLY the provided context. If the context does not contain the answer, respond: "I don't have enough information to answer this." Context: {context_text} Question: {query} Answer:""" response = self.llm.messages.create( model="claude-haiku-3-5-20251001", max_tokens=1024, messages=[{"role": "user", "content": prompt}], ) return response.content[0].text def query(self, question: str) -> dict: chunks = self.retrieve(question) answer = self.generate(question, chunks) return { "answer": answer, "sources": [c["metadata"] for c in chunks], }

Document Type	Recommended Strategy	Chunk Size (tokens)	Notes
Long-form articles / docs	Recursive text splitter on paragraphs	512–1024	Preserve paragraph boundaries
Code files	Split by function/class boundaries	Variable	Never split mid-function
Conversational transcripts	Split by speaker turn + time window	256–512	Include speaker label in chunk
Legal / financial docs	Split by numbered sections or clauses	512–1024	Preserve section header in metadata
Short product descriptions	No chunking — embed full record	Full	Semantic unit is the record
Dense technical manuals	Sentence-level splitting with overlap	256 with 50-token overlap	Overlap prevents boundary loss

Model	Dim	Max Tokens	Strengths	When to Use
text-embedding-3-small (OpenAI)	1536	8192	General purpose, low cost	Default for English general-domain
text-embedding-3-large (OpenAI)	3072	8192	Higher accuracy, higher cost	When quality matters more than cost
all-MiniLM-L6-v2 (local)	384	256	Fast, offline, zero cost	High-throughput local deployments
BAAI/bge-large-en-v1.5 (local)	1024	512	SOTA open-source English	Production-grade offline search
e5-mistral-7b-instruct	4096	32768	Long-context, instruction-following	Long documents, complex queries

Stage	Budget
Query embedding	20ms
ANN vector search	30ms
BM25 keyword search	20ms
RRF merge	2ms
Cross-encoder reranking (top-50)	60ms
LLM generation	1000ms (separate SLA)
Total retrieval	~130ms

Configuration	Index Size	Recall@10	Notes
Full HNSW (ef=200)	Large	~95%	Best accuracy, most memory
HNSW (ef=100)	Medium	~90%	Good balance
IVF flat (nlist=100)	Small	~85%	CPU-friendly, lower recall
Binary quantization	Tiny	~75%	Only for scale-out scenarios

Situation	Response
LLM hallucinating answers not in the corpus	This is a retrieval miss. Check recall@k. Increase top_k, add reranker, or refine chunking. Add grounding instruction to prompt.
Retrieval returns irrelevant chunks for specific queries	Check if query uses terminology not in the corpus. Add query expansion or synonym mapping. Consider BM25 hybrid for exact-match terms.
Embedding model produces poor similarity for domain terms	Fine-tune on domain data or switch to a domain-specific model. Run MTEB-style eval on your corpus first.
Index latency exceeds SLA under load	Profile: embedding latency vs. ANN search latency separately. Consider approximate quantization or pre-computing query embeddings for known query patterns.
Reranker is too slow for the latency budget	Reduce candidate pool (top-20 instead of top-50). Use a smaller cross-encoder. Move reranker to async pre-fetch if the UI allows.
RAG answers are coherent but cite wrong sources	The LLM is hallucinating citations. Add structured citation format requirement and validate cited IDs against retrieved chunk IDs in post-processing.
Pinecone / vector DB costs growing unexpectedly	Audit index size. Implement TTL-based expiry for time-sensitive documents. Consider pgvector (self-hosted) for cost-stable workloads.

Search Vector Architect

Search Vector Architect Skill

Identity

When to Activate

Search Vector Architect

Search Vector Architect Skill

Identity

When to Activate

When NOT to Use

Core Principles

Architecture Overview

Chunking Strategy Decision Guide

Embedding Model Selection

Hybrid Search with Reciprocal Rank Fusion

Cross-Encoder Reranking

Pinecone Vector Store

RAG System

Retrieval Evaluation (NDCG, MRR, Recall@k)

Production Concerns

Latency Budget Allocation (example for 200ms SLA)

Index Size vs. Accuracy Trade-offs

Self-Verification Checklist

Success Criteria

Anti-Patterns

Failure Modes

Integration with Mega-Mind

Healthcare Cdss Patterns

Drug Discovery

Qmd

Attack Tree Construction

Azure Ai Anomalydetector Java

Viboscope