Name: Cf Crawl
Author: davila7

Cf Crawl | Skills Pool

# Load from .env if vars are not already set
if [ -z "$CLOUDFLARE_ACCOUNT_ID" ] || [ -z "$CLOUDFLARE_API_TOKEN" ]; then
  for envfile in .env .env.local "$HOME/.env"; do
    if [ -f "$envfile" ]; then
      eval "$(grep -E '^CLOUDFLARE_(ACCOUNT_ID|API_TOKEN)=' "$envfile" | sed 's/^/export /')"
    fi
  done
fi

CLOUDFLARE_ACCOUNT_ID=your-account-id
CLOUDFLARE_API_TOKEN=your-api-token

curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "options": {
      "excludePatterns": ["**/changelog/**", "**/api-reference/**"]
    }
  }'

curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "modifiedSince": <UNIX_TIMESTAMP>
  }'

{"success": true, "result": "job-uuid-here"}

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?limit=1" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Status: {d[\"result\"][\"status\"]} | Finished: {d[\"result\"][\"finished\"]}/{d[\"result\"][\"total\"]}')"

# See which pages were skipped (not modified since the given timestamp)
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=skipped&limit=50" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50&cursor=<CURSOR>" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

# Create output directory
mkdir -p .crawl-output

# Fetch and save all pages
python3 -c "
import json, os, re, sys, urllib.request

account_id = os.environ['CLOUDFLARE_ACCOUNT_ID']
api_token = os.environ['CLOUDFLARE_API_TOKEN']
job_id = '<JOB_ID>'
base = f'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}'
outdir = '.crawl-output'
os.makedirs(outdir, exist_ok=True)

cursor = None
total_saved = 0

while True:
    url = f'{base}?status=completed&limit=50'
    if cursor:
        url += f'&cursor={cursor}'

    req = urllib.request.Request(url, headers={
        'Authorization': f'Bearer {api_token}'
    })
    with urllib.request.urlopen(req) as resp:
        data = json.load(resp)

    records = data.get('result', {}).get('records', [])
    if not records:
        break

    for rec in records:
        page_url = rec.get('url', '')
        md = rec.get('markdown', '')
        if not md:
            continue
        # Convert URL to filename
        name = re.sub(r'https?://', '', page_url)
        name = re.sub(r'[^a-zA-Z0-9]', '_', name).strip('_')[:120]
        filepath = os.path.join(outdir, f'{name}.md')
        with open(filepath, 'w') as f:
            f.write(f'<!-- Source: {page_url} -->\n\n')
            f.write(md)
        total_saved += 1

    cursor = data.get('result', {}).get('cursor')
    if cursor is None:
        break

print(f'Saved {total_saved} pages to {outdir}/')
"

Parameter	Type	Default	Description
`url`	string	(required)	Starting URL to crawl
`limit`	number	10	Max pages to crawl (up to 100,000)
`depth`	number	100,000	Max link depth from starting URL
`formats`	array	["html"]	Output formats: `html`, `markdown`, `json`
`render`	boolean	true	`true` = headless browser, `false` = fast HTML fetch
`source`	string	"all"	Page discovery: `all`, `sitemaps`, `links`
`maxAge`	number	86400	Cache validity in seconds (max 604800)
`modifiedSince`	number	-	Unix timestamp; only crawl pages modified after this time

Parameter	Type	Default	Description
`includePatterns`	array	[]	Wildcard patterns to include (`` and `*`)
`excludePatterns`	array	[]	Wildcard patterns to exclude (higher priority)
`includeSubdomains`	boolean	false	Follow links to subdomains
`includeExternalLinks`	boolean	false	Follow external links

Parameter	Type	Description
`jsonOptions`	object	AI-powered structured extraction (prompt, response_format)
`authenticate`	object	HTTP basic auth (username, password)
`setExtraHTTPHeaders`	object	Custom headers for requests
`rejectResourceTypes`	array	Skip: image, media, font, stylesheet
`userAgent`	string	Custom user agent string
`cookies`	array	Custom cookies for requests

/cf-crawl https://docs.example.com --limit 50

/cf-crawl https://docs.example.com --limit 100 --include "/guides/**,/api/**" --exclude "/changelog/**"

/cf-crawl https://docs.example.com --limit 50 --since 2026-03-10

/cf-crawl https://docs.example.com --no-render --limit 200

/cf-crawl https://docs.example.com --limit 50 --merge

Cf Crawl

Cloudflare Website Crawler

Prerequisites

Workflow

Step 1: Load Credentials

Cf Crawl

Cloudflare Website Crawler

Prerequisites

Workflow

Step 1: Load Credentials

Step 2: Validate Credentials

Step 3: Initiate Crawl

Step 4: Poll for Completion

Step 5: Retrieve Results

Step 6: Save Results

Parameter Reference

Core Parameters

Options Object

Advanced Parameters

Usage Examples

Crawl documentation site (most common)

Crawl with filters

Incremental crawl (diff detection)

Fast crawl without JavaScript rendering

Crawl and merge into single file

Argument Parsing

Important Notes

Session Logs

OpenClaw Test Heap Leaks

Node Connect

Openclaw Qa Testing

Openclaw Secret Scanning Maintainer

Flags