Explore a MongoDB collection to learn its schema, field types, value distributions, and relationships
Explore one or more MongoDB collections to learn their schema, field types, value distributions, and relationships — then persist the results for future query generation.
Target collection(s): $ARGUMENTS
Database: Read from memory/guide.json → database field. If guide.json doesn't exist yet, ask the user which database to use.
If $ARGUMENTS contains more than one collection name (space- or comma-separated), explore all of them in parallel:
memory/guide.json and call list_collections once upfront. Store the database name and the full collection list.general-purpose) with a prompt that contains the full single-collection exploration instructions (Steps 0–10 below), plus the database name and the full collection list (so the subagent skips Step 1 validation and reuses the list for foreign key checks). Run all Task calls in a single message so they execute in parallel.If only one collection is specified, follow Steps 0–10 directly (no subagent needed).
Track aggregation response times during exploration and adapt query behavior accordingly. This protects production clusters from heavy analytical queries.
| Mode | $sample size | Parallel batch size | Restrictions |
|---|---|---|---|
| normal (default) | 10,000 | 5 | None |
| light | 1,000 | 3 | Skip array element enum detection; skip Step 7 (change frequency) |
| minimal | 500 | 1 (sequential) | Only analyze top-level fields (skip nested dot-notation beyond depth 2) |
Performance mode: normal/light/minimalObserve how long each run_aggregation call takes to return. You don't need precise millisecond timing — if a response is noticeably slow (long delay before result appears), treat it as exceeding the threshold.
Follow steps 0–9 in order. Use the MCP tools list_collections and run_aggregation with the database from memory/guide.json. Do NOT skip steps or combine aggregations in ways that could lose detail.
Bounded aggregations: ALL Step 6 and Step 7 aggregations MUST prepend { "$sample": { "size": 10000 } } as the first pipeline stage (or the current performance mode's sample size). This avoids full collection scans. The only exception is Step 4, which uses $sort + $limit to get recent docs.
Before doing any MCP calls, check if memory/schema/$ARGUMENTS.json already exists using Glob.
If it exists:
Call list_collections with the database from guide.json. If $ARGUMENTS is not in the list, show the available collections and ask the user to pick one. Do not proceed until a valid collection is confirmed.
Important: Store the full collection list from this call. You will reuse it in Step 6 for foreign key verification. Do NOT call list_collections again.
Call list_indexes with the database and collection name. Store the raw index definitions for use in Step 9 (Write Schema File).
_id index (it's always present, no signal)Store each index as:
name — index name from MongoDBkey — the key specification (field → direction/type)unique — boolean (true if unique index, false otherwise)sparse, ttl (expireAfterSeconds), partial (partialFilterExpression)Run:
[{ "$count": "total" }]
If the count is 0, write a minimal schema file (metadata only, empty fields) to memory/schema/$ARGUMENTS.json and stop.
Store the count as documentCount for later.
Sample 100 recent documents for field discovery. Use run_aggregation_to_file to write results directly to a local file — zero context cost regardless of document size.
Call run_aggregation_to_file with:
[{ "$sort": { "_id": -1 } }, { "$limit": 100 }]/tmp/explore_{collection}.jsonThe tool writes a JSON array to the file and returns only metadata (doc count, file path, file size). No raw documents enter the conversation context.
If the tool returns an error, stop and inform the user.
Use jq commands on the temp file to discover fields without loading documents into context.
First, verify the temp file exists and is non-empty:
test -s /tmp/explore_{collection}.json && echo "ok" || echo "EMPTY"
If empty/missing, stop and tell the user.
Then run these jq commands in parallel:
# All leaf field paths in dot-notation
jq '[.[] | [paths(scalars)] | .[] | join(".")] | unique | sort' /tmp/explore_{collection}.json
# Field types (top-level keys)
jq '[.[] | to_entries[] | {k: .key, t: (.value | type)}] | unique_by(.k)' /tmp/explore_{collection}.json
# Presence rates
jq 'length as $n | [.[] | keys[]] | group_by(.) | map({field: .[0], presence: ((length / $n) * 100 | round / 100)}) | sort_by(.field)' /tmp/explore_{collection}.json
Then peek at one document for orientation:
jq '.[0]' /tmp/explore_{collection}.json
From these results, build the field inventory:
sparse: true"null" (likely deprecated)Use dot-notation for nested fields so the final field map is flat: address.city, address.zip, etc. Cap nesting depth at 5 levels.
For each discovered field, run targeted aggregations to understand its values. All aggregations operate on a bounded sample — prepend { "$sample": { "size": <current_mode_sample_size> } } as the first pipeline stage.
For number, date, boolean, and array fields, run separate parallel run_aggregation calls — up to 5 per batch (or the current performance mode's batch size). Wait for each batch to complete before starting the next.
Number fields — one call per field:
[
{ "$sample": { "size": 10000 } },
{ "$group": { "_id": null, "min": { "$min": "$field" }, "max": { "$max": "$field" }, "avg": { "$avg": "$field" } } }
]
Date fields — one call per field:
[
{ "$sample": { "size": 10000 } },
{ "$group": { "_id": null, "min": { "$min": "$field" }, "max": { "$max": "$field" } } }
]
Boolean fields — one call per field:
[
{ "$sample": { "size": 10000 } },
{ "$group": { "_id": "$field", "count": { "$sum": 1 } } }
]
Array fields — one call per field:
[
{ "$sample": { "size": 10000 } },
{ "$group": { "_id": null, "minLen": { "$min": { "$size": "$field" } }, "maxLen": { "$max": { "$size": "$field" } }, "avgLen": { "$avg": { "$size": "$field" } } } }
]
Group related fields into batches of up to 5 calls fired in parallel. After each batch completes, note: Performance mode: <current_mode>.
Run up to 5 string enum aggregations in parallel per round. Wait for each round before starting the next:
[
{ "$sample": { "size": 10000 } },
{ "$group": { "_id": "$strField", "count": { "$sum": 1 } } },
{ "$sort": { "count": -1 } }
]
isEnum: true, store all enumValues with countsisEnum: false, store 5 example valuesFire the first batch of Step 6a and the first batch of Step 6b in the same message for maximum parallelism.
For ObjectId fields ending in Id or Ref, or fields ending in _id (excluding _id itself):
foreignKey candidateorganizationId → organizations, user_id → users)list_collections again.{ "foreignKey": { "targetCollection": "organizations", "confirmed": true/false } }
confirmed: true if target collection exists in the cached listconfirmed: false otherwiseconfirmed (boolean), never confidenceNumber fields:
{ "type": "number", "presence": 1.0, "description": "...", "range": { "min": 0, "max": 100, "avg": 42.5 } }
Always use range object, never flat min/max/avg.
Date fields:
{ "type": "date", "presence": 1.0, "description": "...", "range": { "min": "2024-01-01T...", "max": "2026-02-16T..." } }
String fields (enum):
{ "type": "string", "presence": 1.0, "description": "...", "isEnum": true, "enumValues": { "val1": 100, "val2": 50 } }
String fields (non-enum):
{ "type": "string", "presence": 1.0, "description": "...", "isEnum": false, "examples": ["val1", "val2", "val3", "val4", "val5"] }
Boolean fields:
{ "type": "boolean", "presence": 1.0, "description": "...", "distribution": { "true": 800, "false": 200 } }
Array fields:
{ "type": "array", "presence": 1.0, "description": "...", "arrayDetails": { "elementType": "string", "minLength": 0, "maxLength": 10, "avgLength": 3.2 } }
Determine element type from the sample data. If array elements are strings with <30 distinct values, unwind and count them as enum values too (skip this in light or minimal performance mode).
Skip this step if in light or minimal performance mode.
Check for timestamp fields in this order:
created_at / updated_at (snake_case)createdAt / updatedAt (camelCase)Use whichever pair exists in the field inventory from Step 5. If neither pair exists, set changeFrequency.hasTimestamps: false and skip.
If a matching pair is found, run (substituting the actual field names):
[
{ "$sample": { "size": 10000 } },
{ "$match": { "<updatedField>": { "$exists": true }, "<createdField>": { "$exists": true } } },
{ "$project": { "diff": { "$subtract": ["$<updatedField>", "$<createdField>"] } } },
{ "$group": {
"_id": null,
"avgDiff": { "$avg": "$diff" },
"minDiff": { "$min": "$diff" },
"maxDiff": { "$max": "$diff" },
"docsWithUpdates": { "$sum": { "$cond": [{ "$gt": ["$diff", 0] }, 1, 0] } },
"total": { "$sum": 1 }
}
}
]
Format avgDiff as human-readable (e.g. "4d 6h 32m") and also store the raw millisecond value.
Present a summary table of all discovered fields showing: field name, type, presence %, and whether it's an enum/foreign key.
Then ask the user:
createdAt → "creation date", organizationId → "org")Wait for the user's response before proceeding to Step 9. If the user says "skip" or "none", proceed with empty aliases/notes.
Assemble the final schema object and write it to memory/schema/$ARGUMENTS.json.
Use this exact structure:
{
"metadata": {
"collection": "<collection_name>",
"database": "<from guide.json>",
"documentCount": "<number>",
"exploredAt": "<ISO 8601 timestamp>",
"sampleSize": 100
},
"indexes": [
{
"name": "<index_name>",
"key": { "<field>": 1 },
"unique": false
}
],
"fields": {
"<fieldName>": {
"type": "<primary_type>",
"presence": "<0.0–1.0>",
"description": "<auto-generated or null>",
"...type-specific keys from Step 6"
}
},
"aliases": {
"<fieldName>": ["alias1", "alias2"]
},
"foreignKeys": [
{ "field": "<fieldName>", "targetCollection": "<collection>", "confirmed": true }
],
"changeFrequency": {
"hasTimestamps": true,
"...metrics from Step 7"
},
"notes": ["<user-provided notes>"]
}
Instead of setting "description": null on every field, auto-generate descriptions for obvious fields based on name + type + values:
_id → "MongoDB document ID"__v → "Mongoose version key"created_at/createdAt) → "Document creation timestamp"updated_at/updatedAt) → "Last update timestamp"is_* fields → infer from name (e.g., is_deleted → "Soft delete flag", is_active → "Active status flag")foreignKey → "Reference to {targetCollection}""One of: {top 5 values}"null only for truly ambiguous fields where the name + type don't reveal purposeaddress.city, not nested objects)"sparse": true"mixedTypes": true and list all observed types"deprecated": trueforeignKeys arrayconfirmed (boolean) for foreign keys, never confidenceAfter writing the schema file, also update memory/guide.json:
memory/guide.json (or create it if it doesn't exist with { "collections": {} })collections.$ARGUMENTS with:
aliases — from user notes (empty array if none)description — from user notes (empty string if none)documentCount — from the schema metadatadefaultFilters — from any user-specified "always filter" notes (empty object if none)deprecatedFields — list of all fields marked deprecated: true in the schemareferences — from the foreignKeys array, as a field → targetCollection mapschemaFile — "memory/schema/$ARGUMENTS.json"exploredAt — today's date (YYYY-MM-DD)After writing both files, print a confirmation with:
Remove the temp file:
rm -f /tmp/explore_{collection}.json