Skill File

Wayback Archive

Name: Wayback Archive
Author: saldigioia

Recover a complete product catalog (data + images) from a defunct e-commerce site by URL. Use when the user gives a URL, domain, or comma-separated host list for a dead store and wants the full catalog recovered automatically. Also triggers on mentions of Wayback Machine catalog recovery, CommonCrawl WARC extraction, Shopify CDN archaeology, or rebuilding a product database from a dead website.

saldigioia0 starsApr 17, 2026

Categories: E-commerce

Skill Content

Turn-key Wayback Archive

Recover the product catalog for: $ARGUMENTS

Plan (auto-generated — do not rerun bootstrap)

python3 "${CLAUDE_SKILL_DIR}/../../scripts/bootstrap.py" --input "$ARGUMENTS"

The bootstrap script above has already:

Normalized the input into an apex domain and seed host list.
Queried Wayback CDX for *.{apex} to enumerate captured subdomains (sample ≤ 5000).
Probed the live site (and Wayback most-recent fallback) for platform signatures — Shopify, Swell, Fourthwall, Adidas.
Detected any .myshopify.com alias embedded in the HTML.
Rendered the matching platform template into <projects-root>/<name>/config.yaml and saved the plan to <projects-root>/<name>/plan.json.

Where projects land. The plan JSON has and as — use those verbatim when running downstream stages. Default is ; override via env var or on bootstrap. Projects intentionally live outside the plugin cache so they survive plugin updates.

Related Skills

Wayback Archive | Skills Pool

project_dir

config_path

projects-root

~/wayback-archive/

$WAYBACK_ARCHIVE_ROOT

--project-root <path>

python3 scripts/run_stage.py all --config <config_path> --auto

tail -n 50 projects/<name>/.progress.jsonl

python3 scripts/audit.py --config <config_path>

Residuals after audit. The preferred path is the resume subcommand — it reads projects/<name>/audit.json, picks the largest non-zero bucket, and runs only the stage that would shrink it:
```
python3 scripts/run_stage.py resume --config <cfg> --auto
```
Bucket → stage mapping: unenumerated_hosts → cdx_dump, unresolved_slugs → fetch, retry_queue_depth → fetch --proxy dc --fallback-archives archive_today memento, index_missing → download. Override auto-pick with --bucket <name>.
Preflight blocked the run. Read projects/<name>/preflight.json. Missing Oxylabs creds: copy tools/.env.example → tools/.env and fill in (auto-sourced, no export needed). Missing deps: pip install -r requirements.txt. Low disk: free space or change project_dir. Archive.org unreachable: retry or check VPN/DNS.
Platform misdetected. Inspect projects/<name>/config.yaml, pick a different template from skills/wayback-archive/configs/_template_*.yaml, swap the cdn_patterns / url_rules blocks, re-run resume or the specific stages from filter onward.
CDX tool hangs. Checkpoint files live at tools/.<domain>_wayback.ckpt.json. Resume: cd tools && python -m wayback_cdx --domain <d> --output ../projects/<name>/<d>_wayback.txt --resume.
Low fetch success (<20%). Let resume pick it up via the retry_queue_depth bucket, or run fetch directly with --proxy dc --fallback-archives archive_today memento.
Anti-bot / CAPTCHA blocks. Fall back to HAR-based recovery — see references/playwright-wayback.md.

Wayback Archive

Turn-key Wayback Archive

Plan (auto-generated — do not rerun bootstrap)

Wayback Archive

Turn-key Wayback Archive

Plan (auto-generated — do not rerun bootstrap)

Your task

1. Sanity check

2. One confirmation

3. Execute the pipeline

4. Read the audit (Protocol IV — gated by exit code)

5. Report

Standing protocols (inviolate — from docs/IMPROVEMENT_PLAN.md)

Source-hierarchy priority

Fallback playbook

Reference docs (load only when needed — not preemptively)

Ordercli

Deployment Patterns

Junta Leiloeiros

Persona Sales Ops

Eliza App Development

Ordercli