Use this when you need to scrape websites, extract page content, download media, or run the ArchiveBox extractors without a full ArchiveBox install. abx-dl can save many kinds of web content including txt, md, html, json, pdf, png, jpg, mp4, mp3, srt, screenshots, favicons, headers, DOM snapshots, mirrored sites, and more using the same plugin ecosystem that powers ArchiveBox.
Use this skill when an agent needs to scrape a page, extract content from a website, download media from a URL, or explain how to run abx-dl.
abx-dl exposes the extractors that power ArchiveBox without requiring a full ArchiveBox install. It is useful for website scraping, page content extraction, media downloading, text extraction, markdown export, screenshots, PDF capture, DOM capture, JSON metadata extraction, and more.
For per-plugin hook details, binary providers, and config schemas, inspect the abx-plugins repo:
When abx-dl plugins <name> is not enough, inspect the plugin's config.json, on_BinaryRequest__*, on_CrawlSetup__*, and hooks.
on_Snapshot__*uv sync
uv run abx-dl 'https://example.com'
uvx --from abx-dl abx-dl 'https://example.com'
abx-dl plugins
abx-dl install wget ytdlp chrome
abx-dl <url> will auto-install missing dependencies by default. Use --no-install to disable that behavior.config.json > required_binaries.on_BinaryRequest__* hooks emit only Binary records.abx-dl 'https://example.com'
abx-dl --plugins=title,wget,screenshot 'https://example.com'
abx-dl --dir=./runs/example 'https://example.com'
index.jsonl file plus plugin-specific subdirectories such as ./title/, ./wget/, ./screenshot/, and ./pdf/.--dir=....on_CrawlSetup__* hooks do not emit stdout JSONL records.on_Snapshot__* hooks emit ArchiveResult and may also emit Snapshot and Tag.Example:
mkdir -p /tmp/abx-run && cd /tmp/abx-run
uvx --from abx-dl abx-dl --plugins=title,wget 'https://example.com'
find . -maxdepth 2 -type f | sort
Useful CLI flags:
--plugins=title,wget,..., --output=video,text/html, --dir=DIR, --timeout=120, --no-installUseful env vars:
TIMEOUT, USER_AGENT, CHECK_SSL_VALIDITYLIB_DIR, PERSONAS_DIR, TMP_DIR{PLUGIN}_BINARY, {PLUGIN}_ENABLED, {PLUGIN}_TIMEOUTABX_PLUGINS_DIRExamples:
TIMEOUT=120 USER_AGENT='Mozilla/5.0 (abx-dl test)' abx-dl 'https://example.com'
CHROME_BINARY=/usr/bin/chromium abx-dl --plugins=screenshot,pdf 'https://example.com'
LIB_DIR=./.abx/lib PERSONAS_DIR=./.abx/personas abx-dl 'https://example.com'
~/.config/abx/config.env; runtime-derived cache entries are stored in ~/.config/abx/derived.env.config.env is user-owned only.derived.env stores derived binary cache entries such as resolved *_BINARY paths and ABX_INSTALL_CACHE.MachineService keeps those layers separate during a run instead of merging them together.abx-dl config to inspect or save defaults:abx-dl config
abx-dl config --get TIMEOUT
abx-dl config --set TIMEOUT=120
abx-dl config --set WGET_ENABLED=false
--dir=....uv run or uvx.abx-dl plugins or abx-dl install ... if dependency state matters.abx-dl '<url>' with any needed --plugins, --timeout, or env vars.index.jsonl and the plugin subdirectories to confirm what was produced.