Normalizes messy bank-imported merchant names into clean, consistent labels using aggregate GROUP BY queries. Writes to the normalized_merchant overlay field, preserving original merchant names for provenance. Memory-backed for auto-applying known mappings on future runs.
This skill guides the finance butler through normalizing messy bank-imported merchant
names (e.g., "WHOLEFDS MKT #10456 AUSTIN TX") into clean, consistent labels
(e.g., "Whole Foods"). The canonical form is stored in the normalized_merchant
metadata overlay field; the original merchant field is never modified.
READ THIS SECTION FIRST. These are hard rules, not suggestions.
You MUST NOT call list_transactions or any other tool that reads individual
transaction rows into context at any point during this workflow.
Reason: A typical account has hundreds or thousands of transaction rows. Reading them into context would exhaust the token budget and likely abort the session before any normalization is applied. The unit of work for normalization is the merchant name, not the individual transaction.
The only acceptable data source is list_distinct_merchants, which returns a
compact GROUP BY result — typically 50–200 rows even for a year of transactions.
You MUST NOT attempt to normalize merchants by reading transactions and writing them back one row at a time. This is both token-prohibitive and semantically wrong — normalization is a metadata overlay, not a re-ingestion.
All merchant data used in this workflow comes from list_distinct_merchants. This
tool returns deduplicated merchant names with transaction counts and totals. It is
the correct and only acceptable source for normalization input.
If the first call returns total > 500, you MUST process in paginated batches.
Use limit=500, offset=0, then limit=500, offset=500, etc., until all merchants
are processed. Do not skip pagination — large accounts can have hundreds of
distinct raw merchant strings.
Normalization MUST write to the normalized_merchant metadata field via
bulk_update_transactions. It MUST NOT modify the original merchant field, and
MUST NOT touch subject, predicate, content, or embedding columns. The
original merchant value is the immutable provenance anchor and deduplication key.
The finance butler's fact layer is write-once on core columns. Normalization uses the metadata overlay pattern instead:
| Field | Role |
|---|---|
merchant | Original raw bank string — never modified, preserved for provenance |
normalized_merchant | Clean canonical name set by this skill — preferred for display and aggregation |
Query and display tools (spending_summary, list_transactions, dashboard) prefer
normalized_merchant when present and fall back to merchant when absent. You do
not need to update any query logic — the overlay is transparently consumed.
Call list_distinct_merchants to get the compact merchant list:
result = list_distinct_merchants(
unnormalized_only=True, # only merchants without a normalized_merchant overlay
min_count=2, # skip one-off merchants unlikely to need normalization
)
Inspect the response:
total — total number of distinct unnormalized merchantsmerchants — list of {merchant, count, total_amount} entriestotal > 500, proceed in paginated batches (see CRITICAL CONSTRAINTS)If total == 0, report "All merchants are already normalized." and stop.
Before running any LLM inference, check whether this butler has seen and mapped any of these merchants before:
memory_search(
query="merchant normalization alias",
tags=["merchant-alias"],
limit=100,
)
Also try targeted recall for high-frequency merchants you see in the list:
memory_recall(topic="merchant_alias:WHOLEFDS")
Memory fact schema for aliases (written in Step 7):
subject: "merchant_alias:<RAW_PATTERN>" — the raw bank string or a prefix/patternpredicate: "normalizes_to"content: "Whole Foods" — the clean canonical nametags: ["merchant-alias", "normalization"]Collect all known mappings into a lookup table before proceeding.
For every merchant in the list that matches a known alias, call
bulk_update_transactions immediately — no LLM review needed:
bulk_update_transactions(updates=[
{
"match": {"merchant_pattern": "WHOLEFDS%"},
"set": {"normalized_merchant": "Whole Foods"},
},
{
"match": {"merchant_pattern": "AMZN MKTP US%"},
"set": {"normalized_merchant": "Amazon"},
},
# ... one entry per known alias
])
Track which merchants from the list have been handled so they are excluded from the LLM review in Step 4.
Present the remaining merchants (those not covered by known aliases) to
yourself for grouping. For each merchant, you have count (transaction frequency)
and total_amount (aggregate spend) as context for prioritization.
Group obvious variants into a single canonical name. Apply these heuristics:
"WHOLEFDS MKT #10456 AUSTIN TX" → "Whole Foods""SQ *" prefix → Square payment, keep merchant after *; "AMZN" → "Amazon""NETFLIX.COM" → "Netflix", "SPOTIFY AB" → "Spotify""PAYPAL *MERCHANTNAME", "PAYPAL*MERCHANTNAME" → same merchant"Blue Bottle Coffee" and "Philz Coffee" are separate merchants"eBay", "YouTube")Produce a mapping list: [(raw_pattern, canonical_name), ...]
For each group from Step 4, call bulk_update_transactions:
bulk_update_transactions(updates=[
{
"match": {"merchant_pattern": "SQ *BLUE BOTTLE%"},
"set": {"normalized_merchant": "Blue Bottle Coffee"},
},
{
"match": {"merchant_pattern": "NETFLIX%"},
"set": {"normalized_merchant": "Netflix"},
},
# ... one entry per canonical group
])
Pattern guidance for merchant_pattern:
% matches any sequence of characters, _ matches one character"WHOLEFDS%") over exact matches to catch minor variant strings"SQ *BLUE BOTTLE%" or "SQ*BLUE BOTTLE%" covers both space variants"AMAZON%" if it would also match unrelated merchants; prefer "AMZN MKTP US%" or "AMAZON.COM%" for precisionAccumulate the total_matched and total_updated counts returned by each
bulk_update_transactions call.
After all updates are applied, report a concise summary:
Merchant normalization complete.
Distinct merchants processed: 87
Auto-applied (known aliases): 12 (from memory)
LLM-normalized: 71
Skipped (single-occurrence): 4
Transactions updated: 1,243
If pagination was used, note the batch count:
Processed in 2 batches (1,100 distinct merchants total).
For each new mapping produced by the LLM in Step 4 (i.e., not already in memory), store a memory fact so future runs can auto-apply it:
# Resolve or create the merchant entity first (follow memory-classification skill).
# resolve_or_create_merchant_entity is shorthand for the try/except pattern below —
# use the real MCP calls (as shown in the worked example's Step 7):