Why Atlas Vector Search | Skills Pool

技能档案

Why Atlas Vector Search

MongoDB Atlas Vector Search. $vectorSearch aggregation stage, index definition JSON, quantization (scalar, binary), hybrid with Atlas Search text via $rankFusion, dynamic schema benefits, sharding, Atlas Triggers for auto-embedding. USE WHEN: user mentions "Atlas Vector Search", "$vectorSearch", "MongoDB vector", "$rankFusion", "Atlas Triggers embedding", "MongoDB HNSW" DO NOT USE FOR: self-hosted vector stores - use `vector-stores/qdrant-advanced`, `vector-stores/milvus`; MongoDB without Atlas (self-hosted) - vector search is an Atlas-only feature

claude-dev-suite11 星标2026年4月18日

职业
分类: NoSQL 数据库

技能内容

Why Atlas Vector Search

Vector Search lives inside MongoDB Atlas alongside your operational data. Kill the ETL:

No separate vector DB to sync.
Vectors and documents are one row; filtering is just a MongoDB query.
Atlas Search index (Lucene) and Vector Search index (hnswlib-derived) are queryable in the same aggregation pipeline.
Managed service — sharding, replication, backups handled by Atlas.

Skip it if you are not on Atlas: community MongoDB has no vector search. Use a dedicated vector store instead.

Index Definition (JSON)

Atlas Vector Search indexes are JSON documents. Create via the UI, CLI, or Admin API.

{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1024,
      "similarity": "cosine",
      "quantization": "scalar"
    },
    { "type": "filter", "path": "tenant_id" },
    { "type": "filter", "path": "source" },
    { "type": "filter", "path": "created_at" }
  ]
}

相关技能

# pip install pymongo>=4.7
from pymongo import MongoClient
from pymongo.operations import SearchIndexModel

client = MongoClient(os.environ["MONGODB_URI"])
coll = client["app"]["docs"]

coll.create_search_index(SearchIndexModel(
    definition={
        "fields": [
            {"type": "vector", "path": "embedding",
             "numDimensions": 1024, "similarity": "cosine",
             "quantization": "scalar"},
            {"type": "filter", "path": "tenant_id"},
        ],
    },
    name="docs_vector_idx",
    type="vectorSearch",
))

results = coll.aggregate([
    {"$vectorSearch": {
        "index": "docs_vector_idx",
        "path": "embedding",
        "queryVector": q_vec.tolist(),
        "numCandidates": 200,   # ANN candidate pool
        "limit": 10,             # final top-k
        "filter": {
            "tenant_id": "acme",
            "source": {"$in": ["kb", "faq"]},
        },
    }},
    {"$project": {"text": 1, "source": 1,
                  "score": {"$meta": "vectorSearchScore"}}},
])

Type	Storage	Recall loss	When
none	fp32, 4 B/dim	0%	Small collections, max quality
scalar	1 B/dim (4x smaller)	< 1%	Default for production
binary	1 bit/dim (32x smaller)	2-10% raw, recoverable with rescoring	Billion-scale

{
  "type": "vector",
  "path": "embedding",
  "numDimensions": 1024,
  "similarity": "cosine",
  "quantization": "binary"
}

results = coll.aggregate([
    {"$rankFusion": {
        "input": {
            "pipelines": {
                "vector": [
                    {"$vectorSearch": {
                        "index": "docs_vector_idx",
                        "path": "embedding",
                        "queryVector": q_vec.tolist(),
                        "numCandidates": 200,
                        "limit": 50,
                    }},
                ],
                "text": [
                    {"$search": {
                        "index": "docs_text_idx",
                        "text": {"query": "oauth refresh token",
                                 "path": ["text", "title"]},
                    }},
                    {"$limit": 50},
                ],
            },
        },
        "combination": {"weights": {"vector": 0.6, "text": 0.4}},
    }},
    {"$limit": 10},
    {"$project": {"text": 1, "source": 1,
                  "score": {"$meta": "scoreDetails"}}},
])

coll.insert_many([
    {"_id": "d1", "embedding": [...], "tenant_id": "acme", "type": "article",
     "tags": ["oauth", "auth"], "created_at": datetime.utcnow()},
    {"_id": "d2", "embedding": [...], "tenant_id": "acme", "type": "pdf",
     "page": 12, "source_file": "manual.pdf"},
])

{"$vectorSearch": {
    "index": "docs_vector_idx",
    "queryVector": q_vec.tolist(),
    "numCandidates": 200, "limit": 10,
    "filter": {"tenant_id": "acme", "type": "pdf", "page": {"$lt": 50}},
}}

// Atlas Trigger function (runs on MongoDB Atlas)
exports = async function(changeEvent) {
  const doc = changeEvent.fullDocument;
  if (!doc.text || doc.embedding) return;

  const resp = await context.http.post({
    url: "https://api.openai.com/v1/embeddings",
    headers: {
      "Authorization": [`Bearer ${context.values.get("openai_key")}`],
      "Content-Type": ["application/json"],
    },
    body: JSON.stringify({model: "text-embedding-3-small", input: doc.text}),
  });

  const embedding = JSON.parse(resp.body.text()).data[0].embedding;
  const coll = context.services.get("mongodb-atlas").db("app").collection("docs");
  await coll.updateOne({_id: doc._id}, {$set: {embedding}});
};

sh.shardCollection("app.docs", {tenant_id: "hashed"});

from pymongo import UpdateOne

ops = [
    UpdateOne(
        {"_id": d["id"]},
        {"$set": {"embedding": d["vec"], "text": d["text"], "tenant_id": d["tenant_id"]}},
        upsert=True,
    )
    for d in docs
]
coll.bulk_write(ops, ordered=False)

Anti-Pattern	Fix
Filtering after `$vectorSearch` stage	Put filters inside `filter` of `$vectorSearch` to pre-filter
`numCandidates` = `limit`	Set 10-20x the limit for recall
Writing embeddings from app code when a Trigger could	Atlas Trigger avoids dual-write drift
`similarity: euclidean` for OpenAI embeddings	OpenAI vectors are unit-normalized; use `cosine` or `dotProduct`
Indexing every metadata field as `filter`	Only add `filter` paths you actually filter on
Ignoring scalar quantization	Scalar is near-free quality; always use it
Running Vector Search on community MongoDB	Atlas-only feature; self-hosted needs a dedicated store
Sharding without aligning shard key with tenant filter	Use `tenant_id` (hashed) as shard key for multi-tenant