Recover a complete product catalog (data + images) from a defunct e-commerce site by URL. Use when the user gives a URL, domain, or comma-separated host list for a dead store and wants the full catalog recovered automatically. Also triggers on mentions of Wayback Machine catalog recovery, CommonCrawl WARC extraction, Shopify CDN archaeology, or rebuilding a product database from a dead website.
Recover the product catalog for: $ARGUMENTS
python3 "${CLAUDE_SKILL_DIR}/../../scripts/bootstrap.py" --input "$ARGUMENTS"
The bootstrap script above has already:
*.{apex} to enumerate captured subdomains (sample ≤ 5000)..myshopify.com alias embedded in the HTML.<projects-root>/<name>/config.yaml and saved the plan to <projects-root>/<name>/plan.json.Where projects land. The plan JSON has and as — use those verbatim when running downstream stages. Default is ; override via env var or on bootstrap. Projects intentionally live outside the plugin cache so they survive plugin updates.
project_dirconfig_pathprojects-root~/wayback-archive/$WAYBACK_ARCHIVE_ROOT--project-root <path>Read the JSON plan above. Then do the following, in order:
dry_run == true, stop here — the user asked for a preview. Show them a three-line summary (platform, host_count, config_path) and ask what to adjust.platform == "unknown" OR confidence < 0.6 OR host_count == 1, do not run the pipeline yet. Surface the notes array to the user, ask them to either (a) confirm the generic config, (b) add missing hosts, or (c) specify the correct platform. Re-run bootstrap.py with the updated input if they add hosts.myshopify_domain is non-null, confirm in your summary that it was added to the domains list.Show the user a compact summary:
Target:
<apex>· Platform:<platform>(conf <confidence>) · Hosts:<host_count>· Config:<config_path>About to run:cdx_dump → index → filter → fetch → cdn_discover → match → download → normalize → build. Proceed? [Y/n]
Wait for confirmation. If --dry-run was passed in $ARGUMENTS, skip the prompt.
python3 scripts/run_stage.py all --config <config_path> --auto
--auto implies --yes, streams compact progress events to projects/<name>/.progress.jsonl (one JSON line per stage start/end), and runs the audit gate at the end. The command's exit code is 0 iff the audit passes (all five integers zero) and 1 if residuals remain.
Run from the repo root. Do not flood the chat with the raw log stream. If the run is long (CDX dumps can take tens of minutes), launch it with run_in_background: true and poll the progress file:
tail -n 50 projects/<name>/.progress.jsonl
Report only stage transitions and anomalies (status != "ok", circuit-breaker trips, >30s wall time on a non-fetch stage) in one-line updates to the user — no narrative.
--auto has already written projects/<name>/audit.json and exited with 0 (pass) or 1 (residual). Do not re-run the audit yourself unless the file is missing; if it is, run:
python3 scripts/audit.py --config <config_path>
Open audit.json and read the integers object. The exit code is authoritative — never report "done" on a non-zero exit. If residual, use exemplars to enumerate what's missing and either:
unenumerated_hosts > 0 → re-run cdx_dump for the missing hosts; retry_queue_depth > 0 → re-run fetch with --proxy dc or --fallback-archives archive_today memento; index_missing > 0 → re-run download and check links/<slug>.txt for each empty exemplar.terminal_reason and explain to the user what couldn't be recovered and why. (The ledger refactor — IMPROVEMENT_PLAN phase C3 — will persist these annotations; for now, surface them in your report.)Repeat steps 3–4 until the audit exits zero, or every residual has a terminal_reason.
Three lines, nothing more:
N products, M images — read from audit.json raw_counts)pass or N residual items: <breakdown from integers>)projects/<name>/<name>_catalog.json)I. Entity-first. Count products, not files. A saved feed, sitemap, or collection HTML with no downstream expansion is not progress.
II. Discovery is recursive. Feeds / sitemaps / collections / JSON endpoints are discovery surfaces, never terminal artifacts. Every parse must emit outlinks before the surface is marked processed.
III. New host → immediate enumeration. Any previously unseen hostname observed in any capture triggers a CDX dump and product-URL enumeration. Do not wait for a human prompt. If you see a new host in extracted HTML, re-run bootstrap.py with the expanded host list or append it to the config and re-run cdx_dump.
IV. No "done" without audit. Compute and report the five audit integers (unresolved slugs, unexpanded surfaces, index-missing entries, unenumerated hosts, retry-queue depth) before declaring completion. Any non-zero count blocks the claim unless paired with a terminal_reason.
V. Validate before counting. Extracted strings are candidates, not slugs. Normalize → classify → reject non-product URLs (image assets, CDN paths) → dedupe → report candidates seen and validated-and-new separately.
Drain highest-value surfaces first: json_api > sitemap > feed > collection > home > search > product. products.json?limit=1000 is the holy grail — never let HTML shells starve it.
resume subcommand — it reads projects/<name>/audit.json, picks the largest non-zero bucket, and runs only the stage that would shrink it:
python3 scripts/run_stage.py resume --config <cfg> --auto
Bucket → stage mapping: unenumerated_hosts → cdx_dump, unresolved_slugs → fetch, retry_queue_depth → fetch --proxy dc --fallback-archives archive_today memento, index_missing → download. Override auto-pick with --bucket <name>.projects/<name>/preflight.json. Missing Oxylabs creds: copy tools/.env.example → tools/.env and fill in (auto-sourced, no export needed). Missing deps: pip install -r requirements.txt. Low disk: free space or change project_dir. Archive.org unreachable: retry or check VPN/DNS.projects/<name>/config.yaml, pick a different template from skills/wayback-archive/configs/_template_*.yaml, swap the cdn_patterns / url_rules blocks, re-run resume or the specific stages from filter onward.tools/.<domain>_wayback.ckpt.json. Resume: cd tools && python -m wayback_cdx --domain <d> --output ../projects/<name>/<d>_wayback.txt --resume.resume pick it up via the retry_queue_depth bucket, or run fetch directly with --proxy dc --fallback-archives archive_today memento.references/playwright-wayback.md.