Run this skill FIRST before any website audit. Use when starting an audit, beginning a site crawl, or when the user provides a TARGET_URL. This skill MUST complete before audit-page is used. It builds both queues — discovery (for link extraction) and audit (for compliance checks) — ensuring the full site structure is known before any auditing begins.
This skill executes the DISCOVERY phase of a website audit. It crawls the entire site, extracts links from every page, and populates both queues:
queue/discovery/): pages that need link extractionqueue/audit/): pages ready for compliance auditAfter this skill completes, the discovery queue should be empty (all pages crawled for links) and the audit queue should be full (all pages ready for auditing).
https://www.example.pt)NEVER call init.sh from this skill. The orchestrator already called it and wrote the audit path to .current_audit. Read it from there:
AUDIT_DIR=$(bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" get-audit-dir)
echo "Using audit directory: $AUDIT_DIR"
If get-audit-dir fails, stop immediately and tell the orchestrator: "No active audit found. The orchestrator must call init.sh before running discover-site."
Verify the directory exists and has the expected structure:
ls "$AUDIT_DIR/queue/discovery/" && ls "$AUDIT_DIR/queue/audit/" || {
echo "Error: audit directory structure is missing — init.sh may not have completed" >&2
exit 1
}
Add the root URL as the first discovery task:
bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" add-discovery "<audit_dir>" "<target_url>" 0 "root"
Why this matters: Sitemaps list URLs that may not be linked from any page in the navigation. Without checking the sitemap, the crawl misses orphan pages (e.g., landing pages, old blog posts, promotional pages that are only accessible via direct links or search engines).
playwright-cli eval "await fetch('/sitemap.xml').then(r=>r.text()).catch(()=>'NOT_FOUND')"
If found, extract every <loc> URL and add to the discovery queue:
# For each URL in sitemap:
bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" add-discovery "<audit_dir>" "<sitemap_url>" 1 "sitemap.xml"
The queue.sh script handles deduplication — if a URL is already in either queue, it's silently skipped.
If sitemap.xml is not found, log it and continue. Many sites don't have one.
Why this matters: robots.txt reveals which paths the site owner explicitly blocks from crawling. Respecting these directives is both ethical and legally relevant (especially under EU law where unauthorized access to restricted paths could be problematic). The disallowed paths are also useful audit findings — they may indicate hidden admin areas, staging environments, or content the owner wants to keep private.
playwright-cli eval "await fetch('/robots.txt').then(r=>r.text()).catch(()=>'NOT_FOUND')"
Log all Disallow paths to the audit directory. Do not crawl disallowed paths, but note them in the report.
This is the core of discovery. Process every page in the discovery queue:
WHILE discovery queue has pending tasks:
1. Claim next page:
URL = queue.sh claim "<audit_dir>" discovery orchestrator
2. Open and wait for JS rendering:
playwright-cli open "$URL"
playwright-cli eval "await new Promise(r => setTimeout(r, 2000))"
3. Extract ALL internal links:
playwright-cli eval "JSON.stringify([...document.querySelectorAll('a[href]')].map(a=>({href:a.href,text:a.textContent.trim()})).filter(l=>new URL(l.href).hostname===new URL(document.URL).hostname))"
4. Save a discovery snapshot (text only — no screenshots):
playwright-cli snapshot > "<audit_dir>/snapshots/<slug>_discovery.md"
Why snapshots during discovery: They capture the DOM structure, heading hierarchy,
and element references at the time of crawl. This is useful for debugging if link
extraction misses URLs (e.g., links inside shadow DOM or iframes).
5. Add all newly discovered internal links to the discovery queue:
FOR each link WHERE depth <= MAX_DEPTH:
queue.sh add-discovery "<audit_dir>" "<link_url>" <parent_depth+1> "<parent_url>"
The script deduplicates automatically — no need to check if URL exists.
6. Mark this page's discovery as complete:
queue.sh complete "<audit_dir>" discovery "<slug>"
7. Promote this page to the audit queue:
queue.sh promote "<audit_dir>" "<slug>"
Why promote after completing: The page has been crawled for links (discovery done)
and is now ready for compliance checks (audit pending). This two-step flow ensures
no page enters the audit queue before its links have been extracted.
8. Print progress:
queue.sh status "<audit_dir>" both
Why breadth-first order matters: Breadth-first ensures depth-0 and depth-1 pages (the most important, highest-traffic pages) are discovered and promoted first. If the crawl is interrupted, the audit queue already contains the most critical pages.
bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" status "<audit_dir>" discovery
Expected output: discovery_pending: 0, discovery_in_progress: 0
If discovery queue still has tasks, something went wrong — investigate and retry.
bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" status "<audit_dir>" both
bash "${CLAUDE_PLUGIN_ROOT}/scripts/queue.sh" pacing "<audit_dir>" "<max_pages>"
Print:
📋 Discovery Complete
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Pages discovered: N
Depth 0: X pages
Depth 1: X pages
Depth 2: X pages
Depth 3: X pages
Audit queue: N pages pending
Audit mode: FULL | ADAPTIVE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Update state.json:
{
"discovery_complete": true,
"total_pages_discovered": N,
"audit_mode": "full|adaptive",
"discovery_completed_at": "<ISO-8601>"
}
Each rule exists because of a specific failure in previous audit versions:
Discovery MUST complete before any auditing begins. Why: Without knowing the full site structure, the audit cannot make informed pacing decisions. The Miew v1.2 audit skipped discovery and missed 21 of 29 pages.
Every page must be visited for link extraction — not just the root.
Why: Subpages contain links to other subpages not reachable from the root. For example, /work links to 13 case study pages that the homepage doesn't mention. Extracting links from only the root page misses these entirely.
sitemap.xml is checked DURING discovery, not after. Why: If checked after audit begins, sitemap-only pages are discovered late and may exceed MAX_PAGES budget. Checking during discovery ensures they're in the queue from the start.
No screenshots during discovery. Why: Screenshots are expensive (3 per page × 3 viewports = ~9 per page). During discovery, we only need link extraction and a text snapshot. Saving screenshots for the audit phase keeps discovery fast (~5-10 seconds per page vs 1-2 minutes).
Use queue.sh add-discovery for all URLs — never maintain a separate list.
Why: The queue script handles deduplication, depth tracking, and cross-queue checks. Maintaining a parallel list risks duplicates, missed pages, and state inconsistencies.
Promote pages to audit queue AFTER link extraction is complete. Why: This ensures the audit queue only contains pages that have been fully crawled for links. If a page were promoted before extraction, the orchestrator might dispatch it for audit while its child links are still unknown.
Discovery is complete when:
queue.sh status discovery shows: pending=0, in_progress=0queue.sh status audit shows: pending=N (all discovered pages waiting for audit)state.json has discovery_complete: true