Name: Demo Data Indexing
Author: algolia

Demo Data Indexing | Skills Pool

mkdir -p data && cp <path_to_file> data/products.json

pnpm tsx scripts/download-index.ts

Expected Field	Source Field	Status	Action Needed
`objectID`	?	Present / Mappable / Missing	— / Rename from `id` / Generate UUID
`name`	?	Present / Mappable / Missing	— / Rename from `title` / ... (see name guidance below)
`primary_image`	?	Present / Mappable / Missing	— / Rename from `image` / ...
`price.value`	?	Present / Mappable / Missing	— / Wrap flat number in `{ value: N }` / ...
`brand`	?	Present / Mappable / Missing	— / Rename from `manufacturer` / ...
`description`	?	Present / Mappable / Missing	— / Rename from `body_html` / ...

Here's my analysis of the data against the Product interface:

[Sections A-D]

Questions:

1. **Transform**: Should I generate field mappings for items marked "Mappable"? (Y/n)

2. **Missing critical fields**: How should I handle these?
   - [ ] Synthesize with realistic demo data
   - [ ] Leave empty / null (index as-is)
   - [ ] Enrich via AI

3. **Business metrics**: Synthesize sales/margin data for custom ranking? (Y/n)

4. **Enrichments** — which would you like? (select all that apply)
   - [ ] keywords extraction
   - [ ] semantic_attributes generation
   - [ ] image_description generation
   - [ ] Other: ___

5. **Enrichment source**:
   - [ ] OpenAI structured outputs (requires OPENAI_API_KEY in .env)
   - [ ] Other API / source: ___

6. **Anything else** to transform or enrich?

// Deterministic pseudo-random from objectID for reproducible demo data
function seedRandom(str: string): number {
  let hash = 0;
  for (let i = 0; i < str.length; i++) {
    hash = ((hash << 5) - hash) + str.charCodeAt(i);
    hash |= 0;
  }
  return Math.abs(hash) / 2147483647;
}

// Track which fields are synthetic so they're visible in the Algolia dashboard
const syntheticFields: string[] = [];
if (!sourceHasPrice) syntheticFields.push("price");
if (!sourceHasReviews) syntheticFields.push("reviews");
if (synthesizeBusinessMetrics) syntheticFields.push("sales_last_24h", "sales_last_7d", "sales_last_30d", "sales_last_90d", "margin", "product_aov");
record._synthetic_fields = syntheticFields;

pnpm add openai zod    # if using OpenAI structured outputs

import { z } from "zod";
const EnrichedFields = z.object({
  keywords: z.array(z.string()).describe("Search keywords extracted from product attributes"),
  semantic_attributes: z.string().describe("Natural language product summary for semantic search"),
  // ... other fields based on user selections
});

import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";

const openai = new OpenAI(); // reads OPENAI_API_KEY from env

for (let i = 0; i < records.length; i += BATCH_SIZE) {
  const batch = records.slice(i, i + BATCH_SIZE);
  await Promise.all(batch.map(async (record) => {
    const completion = await openai.beta.chat.completions.parse({
      model: "gpt-4o-mini",
      messages: [{
        role: "user",
        content: `Extract structured product data:\n${JSON.stringify(record, null, 2)}`
      }],
      response_format: zodResponseFormat(EnrichedFields, "enriched_product"),
    });
    record._enriched = completion.choices[0].message.parsed;
    // Promote enriched fields to top level
    Object.assign(record, record._enriched);
  }));
  console.log(`Enriched ${Math.min(i + BATCH_SIZE, records.length)}/${records.length} records...`);
}

searchableAttributes: [
  // P1: Short, precise text — brand/category/color matches are unambiguous
  "unordered(brand)",
  "unordered(searchable_categories.lvl0), unordered(searchable_categories.lvl1), unordered(searchable_categories.lvl2)",
  "unordered(color.original_name), unordered(gender)",
  // P2: Product name — ordered so matches at the start rank higher
  "name",
  // P3: Exact lookups
  "unordered(sku)",
  // P4: Enriched search terms
  "unordered(keywords)",
  // P5: Long text — catches long-tail but noisy, lowest priority
  "unordered(description)",
  "unordered(semantic_attributes)",
],

pnpm tsx scripts/index-data.ts [path/to/products.json]

Critical field check — for each sampled record, verify:
- name is a clean, marketing-friendly product name (not an internal ID or raw code)
- primary_image resolves to a valid URL
- price.value is a reasonable number (not 0, not absurdly high)
- brand is populated and human-readable
- description is populated and not HTML/markdown
Synthetic field check — verify _synthetic_fields is populated on records that have fabricated data. Every synthetic field should be listed.

Coverage report — for each critical + secondary field, report the population rate:

name:           1934/1934 (100%)
primary_image:  1920/1934 (99%)
price.value:    1934/1934 (100%) [SYNTHETIC]
brand:          1934/1934 (100%)
gender:         614/1934 (32%)  ⚠️ LOW
reviews:        1934/1934 (100%) [SYNTHETIC]

Flag issues — any field with >30% empty/default values gets a warning. Any field where >50% of values are a single default (e.g., "Unisex") gets flagged as potentially unreliable for faceting.

pnpm tsx scripts/setup-recommend.ts

pnpm tsx scripts/setup-query-suggestions.ts

Data indexing complete!

  Records:       1,794
  Transforms:    12 field mappings applied
  Enrichments:   3 AI-generated fields (keywords, semantic_attributes, image_description)
  Synthetic:     price, reviews, sales_last_24h/7d/30d/90d, margin, product_aov
                 (marked in _synthetic_fields on each record)

  Image domains: cdn.example.com, images.example.com
  Categories:    12 top-level, 47 total
  Facets:        brand, color, gender, size, price

  Validation:    ✓ name clean (100%), ✓ images resolve, ⚠️ gender 32% populated
  Recommend:     Related Products + Looking Similar trained
  QS:            <INDEX_NAME>_query_suggestions created

Number of products indexed
Image domains found — extract unique hostnames from all image URLs in the data. These need to be added to DEMO_CONFIG.imageDomains in lib/demo-config/index.ts (via /demo-branding).
Full hierarchical_categories facet values — list every unique category string at all levels:
```
Women
Women > Bottoms
Women > Bottoms > Shorts
Men
Men > Tops
Men > Tops > T-Shirts
```
These are needed by /demo-categories to build the category navigation.
Available facet attributes — list all facet names for use by /demo-user-profiles

Demo Data Indexing

Prerequisites

Inputs Needed

Step 1: Get Product Data

JSON/CSV file

Demo Data Indexing

Prerequisites

Inputs Needed

Step 1: Get Product Data

JSON/CSV file

Existing Algolia index

Web scraping

No data

Step 2: Analyze Data Structure

2.0: Load & Sample Data

2.1: Analyze

Section A — Critical Fields

Section B — Important Secondary Fields

Section C — Business/Ranking Fields

Section D — Enrichment Opportunities

2.2: Present Findings & Gather Input

2.3: Generate transformRecords()

2.4: Generate enrichRecords()

Install dependencies

Verify API key

Populate the function

2.5: Update Index Settings (if needed)

Searchable Attributes (critical for relevance)

Other Settings

2.6: Validate Transform/Enrich

Step 3: Index Data

Step 3.5: Validate Indexed Data

Step 4: Setup Recommend Models

Step 5: Setup Query Suggestions

Does NOT Run

Output Report

Error Handling

Continuous Learning V2

Continuous Learning V2

Continuous Learning V2

Continuous Learning

Continuous Learning

Pytorch Patterns

2.3: Generate `transformRecords()`

2.4: Generate `enrichRecords()`