技能內容

Embedding and Vector Database Patterns

Intro

Semantic search lives or dies on three choices: the embedding model, the vector store, and the index parameters. Default to cosine similarity, hybrid retrieval (dense + BM25), and HNSW indexes — then benchmark on your own eval set before tuning.

Overview

Embedding model selection

Pick on domain match, dimensionality, context window, latency, and cost. MTEB scores are general — benchmark on YOUR data. Smaller models (MiniLM) suit real-time; larger ones (e5-mistral) suit batch. Open-source is free to run but needs GPU infrastructure.

Commercial:

Model	Dims	Max tokens	Strengths
OpenAI text-embedding-3-small	1536	8191	Good default, low cost
OpenAI text-embedding-3-large

Model	Dims	Max tokens	Strengths
nomic-embed-text-v1.5	768	8192	Strong MTEB, Matryoshka
bge-large-en-v1.5	1024	512	English, well-tested
e5-mistral-7b-instruct	4096	32768	Best open quality, high compute
all-MiniLM-L6-v2	384	256	Tiny, fast, prototyping

Database	Architecture	Best for	Scaling
FAISS	In-memory library	Prototyping, < 10M	Single machine
pgvector	Postgres extension	Postgres shops, joins + filtering	Vertical
Chroma	Embedded DB	Local dev, quick experiments	Single, < 1M
Qdrant	Rust client-server	Production, advanced filtering	Horizontal
Weaviate	Go, built-in vectorizers	Multimodal, auto-vectorization	Horizontal
Pinecone	Managed SaaS	Zero-ops, serverless	Fully managed
Milvus	Distributed cloud-native	Billions of vectors	Horizontal

Choosing an embedding model based solely on MTEB leaderboard scores. MTEB benchmarks general-purpose retrieval; your domain may have different vocabulary, query patterns, or document lengths. Always benchmark candidate models on a small eval set from your own data before committing to one.
Using L2 (Euclidean) distance with embeddings trained for cosine similarity. Most embedding models are trained with cosine similarity as the loss. Using L2 distance with them produces incorrect similarity rankings. Check the model card for the intended distance metric before configuring the vector store index.
Embedding raw documents and chunking after the fact. An embedding represents the full input as a single vector. Embedding a 10,000-word document produces one vector that represents everything and nothing specifically. Chunk first, then embed each chunk.
Pure dense retrieval for catalogs with exact identifiers. Dense (semantic) retrieval fails on product SKUs, proper nouns, acronyms, and precise version numbers — it is too fuzzy. Combine with sparse (BM25 / keyword) retrieval using reciprocal rank fusion for these use cases.
No eval set to detect retrieval regressions. Without a golden set of queries with expected results, there is no way to know whether a change to chunking, embedding model, or index parameters improved or degraded retrieval quality. Build an eval set before the first production deployment.
Single embedding model for both query and document without asymmetric fine-tuning. Queries and documents have different linguistic patterns. Some models (e.g., Voyage, e5-instruct) require instruction prefixes like "Represent this document:" for passages and "Represent this query:" for queries. Using the wrong prefix produces poorer retrieval.
Sharing one DataLoader (or equivalent batch object) instance across search requests. Results from a cached retrieval object may belong to a different request's context. Always create per-request retrieval objects when batching or caching at the database layer.

Embedding Vectordb | Skills Pool

Metric	Range	Best for
Cosine similarity	[-1, 1]	Normalized embeddings (most common)
Dot product	(-inf, inf)	When magnitude matters
Euclidean (L2)	[0, inf)	Spatial clustering

Embedding Vectordb

Embedding Vectordb

Embedding and Vector Database Patterns

Intro

Overview

Embedding model selection

Dimensionality and Matryoshka

Similarity metrics

Vector database selection

Index types

Hybrid search

Gotchas

Full reference

Metadata filtering and multi-tenancy

Embedding pipeline best practices

Worked examples

Anti-patterns

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns