Analyze product data structure, generate transform/enrich code, index into Algolia, train Recommend models, and set up Query Suggestions. Use when preparing and indexing products from a JSON/CSV file, downloading from an existing Algolia index, or after scraping with /demo-scrape.
Analyze raw product data, generate transform/enrich code, load into Algolia, and configure search features (Recommend models, Query Suggestions).
.env file has ALGOLIA_ADMIN_API_KEYlib/algolia-config.ts has correct APP_ID, SEARCH_API_KEY, and INDEX_NAMEpnpm install)Ask the user for their data source:
/demo-scrape firstFollow the path matching the data source:
mkdir -p data && cp <path_to_file> data/products.json
pnpm tsx scripts/download-index.ts
This downloads all records from the source index configured in the script to data/products.json.
Invoke the /demo-scrape skill, which handles the full scraping workflow and saves output to data/products.json.
Skip — inform the user they can add data later and re-run this skill.
Skip this step if:
transformRecords() and enrichRecords() in scripts/index-data.ts are already populateddata/products.json (or user-specified file) — sample the first 3-5 records to understand the structurelib/types/product.ts for the target Product interfacescripts/index-data.ts to see current transform/enrich state/demo-discovery, use its Data Requirements section to prioritize the analysisProduce a structured report with 4 sections.
These 6 fields are required for the UI to function. If any are missing, the demo will break.
| Expected Field | Source Field | Status | Action Needed |
|---|---|---|---|
objectID | ? | Present / Mappable / Missing | — / Rename from id / Generate UUID |
name | ? | Present / Mappable / Missing | — / Rename from title / ... (see name guidance below) |
primary_image | ? | Present / Mappable / Missing | — / Rename from image / ... |
price.value | ? | Present / Mappable / Missing | — / Wrap flat number in { value: N } / ... |
brand | ? | Present / Mappable / Missing | — / Rename from manufacturer / ... |
description | ? | Present / Mappable / Missing | — / Rename from body_html / ... |
For each field, show the actual source field name and a sample value.
Name field guidance: Check ALL candidate fields for name, not just the obvious one. Feeds often have multiple name fields (e.g., name, display_name, var_display_name, title, default_name). For each candidate, report:
Prefer marketing-friendly display names over internal names. If a variant-level name exists (e.g., var_display_name with pattern "NAME - COLOR"), note that the color suffix can be stripped. Fall back through candidates in quality order, not just pick the first match.
Price field guidance: If price data is completely absent from the feed, flag this as a critical gap and present options:
_synthetic_fields)Same format for fields that enhance specific features:
hierarchical_categories (object: { lvl0, lvl1, lvl2 }) — needed for category navigation, Recommend CBF, Query Suggestionscolor.filter_group + color.original_name — needed for color swatchesavailable_sizes (string[]) — needed for size filteringimage_urls (string[]) — needed for product page image gallerygender (string) — needed for faceting, Recommend CBFdiscount_rate (number) — needed for sale badgesreviews ({ bayesian_avg, count, rating }) — needed for review display and rankingvariants (array) — needed for color variant swatchesslug (string) — URL-safe identifiersku (string) — product SKU displayIf a discovery brief exists, prioritize fields flagged in its data requirements.
Field verification rule: Before mapping any field, verify its actual content — field names are often misleading. Sample 3-5 unique values and confirm they match the expected type. Common traps:
Report any mismatches in the analysis.
These drive custom ranking and merchandising. Often absent in raw data.
sales_last_24h / sales_last_7d / sales_last_30d / sales_last_90d — popularity rankingmargin — profit margin for business-aware rankingproduct_aov — average order valueNote: these can be synthesized with deterministic pseudo-random values seeded from objectID for demo purposes. When synthesizing any field, its name must be added to the record's _synthetic_fields string array (see Step 2.3).
Based on what's present in the data + discovery brief needs, suggest AI-generated fields:
keywords — extract search keywords from name + description + categoriessemantic_attributes — natural language product summary for NeuralSearch / semantic searchimage_description — describe the primary product imageGender warning: Do NOT infer gender from product names — it's unreliable and typically produces 60-70% "Unisex" defaults, making the facet nearly useless. Only map gender if the source data has an explicit, verified gender/audience field. If gender is needed but absent, flag it as a gap and ask the user — don't silently infer.
For each enrichment, explain the value it adds to the demo.
Present ALL sections from 2.1, then ask consolidated questions:
Here's my analysis of the data against the Product interface:
[Sections A-D]
Questions:
1. **Transform**: Should I generate field mappings for items marked "Mappable"? (Y/n)
2. **Missing critical fields**: How should I handle these?
- [ ] Synthesize with realistic demo data
- [ ] Leave empty / null (index as-is)
- [ ] Enrich via AI
3. **Business metrics**: Synthesize sales/margin data for custom ranking? (Y/n)
4. **Enrichments** — which would you like? (select all that apply)
- [ ] keywords extraction
- [ ] semantic_attributes generation
- [ ] image_description generation
- [ ] Other: ___
5. **Enrichment source**:
- [ ] OpenAI structured outputs (requires OPENAI_API_KEY in .env)
- [ ] Other API / source: ___
6. **Anything else** to transform or enrich?
WAIT for user response before proceeding.
transformRecords()Populate the transformRecords() function in scripts/index-data.ts based on the analysis:
record.title → name)price: 29.99 → price: { value: 29.99 })hierarchical_categories from flat category data if needed (e.g., split breadcrumb strings on >)color object from flat color string if needed ({ filter_group, original_name })slug from name if missing (lowercase, replace spaces with hyphens, strip special chars)sku from objectID if missingobjectID:// Deterministic pseudo-random from objectID for reproducible demo data
function seedRandom(str: string): number {
let hash = 0;
for (let i = 0; i < str.length; i++) {
hash = ((hash << 5) - hash) + str.charCodeAt(i);
hash |= 0;
}
return Math.abs(hash) / 2147483647;
}
_synthetic_fields marker (required): Every synthesized or fabricated field must be tracked. At the end of transformRecords(), populate _synthetic_fields with the names of all fields that were generated rather than mapped from the source data:
// Track which fields are synthetic so they're visible in the Algolia dashboard
const syntheticFields: string[] = [];
if (!sourceHasPrice) syntheticFields.push("price");
if (!sourceHasReviews) syntheticFields.push("reviews");
if (synthesizeBusinessMetrics) syntheticFields.push("sales_last_24h", "sales_last_7d", "sales_last_30d", "sales_last_90d", "margin", "product_aov");
record._synthetic_fields = syntheticFields;
This lets anyone browsing the index immediately understand what's real vs fabricated.
The function must be pure (no async, no external calls) and operate via .map().
enrichRecords()If enrichment was requested:
pnpm add openai zod # if using OpenAI structured outputs
Check that OPENAI_API_KEY is set in .env. If not, prompt the user to add it.
Populate enrichRecords() in scripts/index-data.ts:
import { z } from "zod";
const EnrichedFields = z.object({
keywords: z.array(z.string()).describe("Search keywords extracted from product attributes"),
semantic_attributes: z.string().describe("Natural language product summary for semantic search"),
// ... other fields based on user selections
});
Process in batches (50 records at a time for API rate limits)
Use OpenAI structured outputs:
import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";
const openai = new OpenAI(); // reads OPENAI_API_KEY from env
for (let i = 0; i < records.length; i += BATCH_SIZE) {
const batch = records.slice(i, i + BATCH_SIZE);
await Promise.all(batch.map(async (record) => {
const completion = await openai.beta.chat.completions.parse({
model: "gpt-4o-mini",
messages: [{
role: "user",
content: `Extract structured product data:\n${JSON.stringify(record, null, 2)}`
}],
response_format: zodResponseFormat(EnrichedFields, "enriched_product"),
});
record._enriched = completion.choices[0].message.parsed;
// Promote enriched fields to top level
Object.assign(record, record._enriched);
}));
console.log(`Enriched ${Math.min(i + BATCH_SIZE, records.length)}/${records.length} records...`);
}
Handle errors gracefully — skip failed records with a warning, don't abort the whole batch
Log progress — Enriched 50/1794 records...
Alternative enrichment sources: If OpenAI is not the chosen source, adapt the pattern for the user's preferred API. The _enriched namespace and batch processing pattern remain the same.
If new attributes were added via transform or enrich, update the settings object in scripts/index-data.ts.
The base script has sensible defaults, but review and adapt the order for each demo. The order directly affects ranking — attributes higher in the list are more relevant.
Rules (from Algolia docs):
unordered() by default. Word position within a value rarely matters. The exception is name — keep it ordered so a match at the start of a product name ("Nike Air Max 90") ranks higher than a match in the middle.attributesForFaceting only.Default priority template for e-commerce (tie-breaking):
searchableAttributes: [
// P1: Short, precise text — brand/category/color matches are unambiguous
"unordered(brand)",
"unordered(searchable_categories.lvl0), unordered(searchable_categories.lvl1), unordered(searchable_categories.lvl2)",
"unordered(color.original_name), unordered(gender)",
// P2: Product name — ordered so matches at the start rank higher
"name",
// P3: Exact lookups
"unordered(sku)",
// P4: Enriched search terms
"unordered(keywords)",
// P5: Long text — catches long-tail but noisy, lowest priority
"unordered(description)",
"unordered(semantic_attributes)",
],
Adapt this per demo. For example, if the vertical has a strong "material" or "collection" attribute, add it at P1 alongside brand/category. If description is thin or empty, drop it or remove it.
attributesForFaceting — add new filterable attributesattributesToRetrieve — add any new fields that the frontend needscustomRanking — add synthesized business metrics if applicableOnly modify settings that changed — don't rewrite the entire settings block.
pnpm tsx scripts/index-data.ts [path/to/products.json]
Default path is data/products.json. The script:
categoryPageId (flat array of all ancestor category paths) from hierarchical_categories<INDEX_NAME>_composition)After indexing completes, sample 5-10 records from the Algolia index and verify data quality:
Critical field check — for each sampled record, verify:
name is a clean, marketing-friendly product name (not an internal ID or raw code)primary_image resolves to a valid URLprice.value is a reasonable number (not 0, not absurdly high)brand is populated and human-readabledescription is populated and not HTML/markdownSynthetic field check — verify _synthetic_fields is populated on records that have fabricated data. Every synthetic field should be listed.
Coverage report — for each critical + secondary field, report the population rate:
name: 1934/1934 (100%)
primary_image: 1920/1934 (99%)
price.value: 1934/1934 (100%) [SYNTHETIC]
brand: 1934/1934 (100%)
gender: 614/1934 (32%) ⚠️ LOW
reviews: 1934/1934 (100%) [SYNTHETIC]
Flag issues — any field with >30% empty/default values gets a warning. Any field where >50% of values are a single default (e.g., "Unisex") gets flagged as potentially unreliable for faceting.
If critical issues are found, ask the user before proceeding to Recommend/QS setup.
pnpm tsx scripts/setup-recommend.ts
Trains two Algolia Recommend models (takes minutes to hours depending on data size):
hierarchical_categories.lvl0, brand, genderprimary_imagepnpm tsx scripts/setup-query-suggestions.ts
Configures Query Suggestions from index facet data (no event tracking required). Mines facet values from: brand, categories (lvl0/lvl1), gender, color, and combinations thereof.
Creates index: <INDEX_NAME>_query_suggestions.
This skill does NOT run scripts/setup-agent.ts. Agent setup is a separate concern — use /demo-agent-setup for that.
When done, report ALL of the following to the user:
Data indexing complete!
Records: 1,794
Transforms: 12 field mappings applied
Enrichments: 3 AI-generated fields (keywords, semantic_attributes, image_description)
Synthetic: price, reviews, sales_last_24h/7d/30d/90d, margin, product_aov
(marked in _synthetic_fields on each record)
Image domains: cdn.example.com, images.example.com
Categories: 12 top-level, 47 total
Facets: brand, color, gender, size, price
Validation: ✓ name clean (100%), ✓ images resolve, ⚠️ gender 32% populated
Recommend: Related Products + Looking Similar trained
QS: <INDEX_NAME>_query_suggestions created
Include specifically:
DEMO_CONFIG.imageDomains in lib/demo-config/index.ts (via /demo-branding).hierarchical_categories facet values — list every unique category string at all levels:
Women
Women > Bottoms
Women > Bottoms > Shorts
Men
Men > Tops
Men > Tops > T-Shirts
These are needed by /demo-categories to build the category navigation./demo-user-profilesdata/products.json doesn't exist, tell the user to get data first (/demo-scrape or provide a file)transformRecords() / enrichRecords() don't exist in scripts/index-data.ts, add them (they should be there as placeholders)ALGOLIA_ADMIN_API_KEY is set in .envAPP_ID and INDEX_NAME in lib/algolia-config.ts