Name: Discover Site
Author: nelsonmiew

搵技能.../

Discover Site | Skills Pool

AUDIT_DIR=$(bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" get-audit-dir)
echo "Using audit directory: $AUDIT_DIR"

ls "$AUDIT_DIR/queue/discovery/" && ls "$AUDIT_DIR/queue/audit/" || {
  echo "Error: audit directory structure is missing — init.sh may not have completed" >&2
  exit 1
}

bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" add-discovery "<audit_dir>" "<target_url>" 0 "root"

playwright-cli eval "await fetch('/sitemap.xml').then(r=>r.text()).catch(()=>'NOT_FOUND')"

# For each URL in sitemap:
bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" add-discovery "<audit_dir>" "<sitemap_url>" 1 "sitemap.xml"

playwright-cli eval "await fetch('/robots.txt').then(r=>r.text()).catch(()=>'NOT_FOUND')"

WHILE discovery queue has pending tasks:
  1. Claim next page:
     URL = queue.sh claim "<audit_dir>" discovery orchestrator

  2. Open and wait for JS rendering:
     playwright-cli open "$URL"
     playwright-cli eval "await new Promise(r => setTimeout(r, 2000))"

  3. Extract ALL internal links:
     playwright-cli eval "JSON.stringify([...document.querySelectorAll('a[href]')].map(a=>({href:a.href,text:a.textContent.trim()})).filter(l=>new URL(l.href).hostname===new URL(document.URL).hostname))"
  
  4. Save a discovery snapshot (text only — no screenshots):
     playwright-cli snapshot > "<audit_dir>/snapshots/<slug>_discovery.md"
     
     Why snapshots during discovery: They capture the DOM structure, heading hierarchy,
     and element references at the time of crawl. This is useful for debugging if link
     extraction misses URLs (e.g., links inside shadow DOM or iframes).

  5. Add all newly discovered internal links to the discovery queue:
     FOR each link WHERE depth <= MAX_DEPTH:
       queue.sh add-discovery "<audit_dir>" "<link_url>" <parent_depth+1> "<parent_url>"
     
     The script deduplicates automatically — no need to check if URL exists.

  6. Mark this page's discovery as complete:
     queue.sh complete "<audit_dir>" discovery "<slug>"

  7. Promote this page to the audit queue:
     queue.sh promote "<audit_dir>" "<slug>"
     
     Why promote after completing: The page has been crawled for links (discovery done)
     and is now ready for compliance checks (audit pending). This two-step flow ensures
     no page enters the audit queue before its links have been extracted.

  8. Print progress:
     queue.sh status "<audit_dir>" both

bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" status "<audit_dir>" discovery

bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" status "<audit_dir>" both
bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" pacing "<audit_dir>" "<max_pages>"

📋 Discovery Complete
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Pages discovered: N
  Depth 0: X pages
  Depth 1: X pages
  Depth 2: X pages
  Depth 3: X pages
Audit queue: N pages pending
Audit mode: FULL | ADAPTIVE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

{
  "discovery_complete": true,
  "total_pages_discovered": N,
  "audit_mode": "full|adaptive",
  "discovery_completed_at": "<ISO-8601>"
}

Discover Site

Website Audit Discovery Crawl

Prerequisites

Workflow

Step 1 — Read Audit Directory from Pointer

Discover Site

Website Audit Discovery Crawl

Prerequisites

Workflow

Step 1 — Read Audit Directory from Pointer

Step 2 — Seed the Discovery Queue

Step 3 — Fetch and Parse sitemap.xml

Step 4 — Fetch and Parse robots.txt

Step 5 — Breadth-First Link Crawl

Step 6 — Verify Discovery is Complete

Step 7 — Print Summary and Determine Audit Mode

Step 8 — Update State

Enforcement Rules

Error Handling

Success Criteria

Things Mac

Trello

Production Scheduling

Jira Integration

Production Scheduling

Cost Aware Llm Pipeline