Convert PDF files to Markdown using opendataloader-pdf. Extracts text, tables, headings, lists, and images with correct reading order. Use for PDF parsing, PDF to Markdown conversion, document extraction, and AI-ready data preparation.
uvx opendataloader-pdf to run — no installation requireduvx mdformat on the output to normalize Markdown formattingFollow resources/execution-protocol.md step by step.
uvx opendataloader-pdf input.pdf
uvx opendataloader-pdf input.pdf --output-dir ./output/
uvx opendataloader-pdf file1.pdf file2.pdf folder/
Requires hybrid mode server:
uvx opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"
uvx opendataloader-pdf --hybrid docling-fast input.pdf
uvx opendataloader-pdf input.pdf --image-output embedded --image-format png
uvx opendataloader-pdf input.pdf --use-struct-tree
| Format | Flag | Use case |
|---|---|---|
| Markdown | --format markdown | Default. Clean text for LLM/RAG |
| JSON | --format json | Structured data with bounding boxes |
| HTML | --format html | Web display |
| Text | --format text | Plain text extraction |
| Combined | --format markdown,json | Multiple formats at once |
Project-specific settings: config/pdf-config.yaml
| Issue | Solution |
|---|---|
| Garbled text in output | Try --use-struct-tree for Tagged PDFs |
| Scanned PDF (no text layer) | Use hybrid mode with --force-ocr |
| Tables not extracted properly | Use hybrid mode for complex/borderless tables |
| Non-English PDF | Add --ocr-lang with appropriate language codes |
| Large PDF (100+ pages) | Process in page ranges or use batch mode |
| Formula not extracted | Use hybrid mode with --enrich-formula |
resources/execution-protocol.mdconfig/pdf-config.yaml../_shared/core/context-loading.md../_shared/core/quality-principles.md