Convert HTML pages to clean, agent-friendly markdown using Readability + Turndown. Strips navigation, ads, footers, cookie banners, social CTAs. Supports URL fetch, local files, stdin, token budgeting, and output flags. Use when fetching web pages for research, content extraction, or scraping in agent workflows — especially when token budget matters or pages have heavy boilerplate.
Aggressive HTML-to-markdown converter for AI agents. Mozilla Readability isolates main content, Turndown converts to markdown, then heavy post-processing strips remaining noise.
Full flag reference and advanced examples:
references/usage.md
git clone https://github.com/saikatkumardey/html2md ~/.claude/skills/html2md
cd ~/.claude/skills/html2md/scripts && npm install && npm link
Requires Node.js 22+. The npm link step makes html2md available globally on PATH.
git clone https://github.com/saikatkumardey/html2md
cd html2md/scripts && npm install && npm link
html2md https://example.com # fetch + convert
html2md --file page.html # local HTML file
cat page.html | html2md --stdin # pipe from stdin
html2md --max-tokens 2000 https://example.com # budget-aware truncation
html2md --no-links https://example.com # strip hrefs, keep text
html2md --json https://example.com # JSON: {title, url, markdown, tokens}
<body> when Readability returns too little (e.g. HN's table layout).--max-tokens N keeps all headings, fills remaining budget in document order, appends [truncated — N more tokens]. Uses 1 token ≈ 4 chars heuristic.--json for programmatic use.web_fetchUse html2md when | Use web_fetch when |
|---|---|
| Reading pages in cron jobs / sub-agents | Quick one-off fetch in main session |
Token budget matters (--max-tokens) | Page is a JSON/XML API endpoint |
| Heavy nav/ads/footers to strip | JS rendering not needed |
| Need JSON output | Simple pages |
html2md fetches URLs and reads local files — that's its job. If you're passing untrusted input:
--file reads any path the process can access. In agent workflows, the agent controls the path — this is equivalent to the agent using cat.execFileSync (not execSync) to avoid shell injection.# Read a Paul Graham essay within 2000 tokens
html2md --max-tokens 2000 https://paulgraham.com/greatwork.html
# HN front page as clean text, no link noise
html2md --no-links --no-images https://news.ycombinator.com
# Get token count before committing
html2md --json https://example.com | jq .tokens
# Pipe to file
html2md https://docs.example.com/api > api-docs.md