Parse, convert, chunk, and analyze documents using Docling. Use this skill when the user provides a document (PDF, DOCX, PPTX, HTML, image) as a file path or URL and wants to: extract text or structured content, convert to Markdown or JSON, chunk the document for RAG ingestion, analyze document structure (headings, tables, figures, reading order), or run quality evaluation with iterative pipeline tuning. Triggers: "parse this PDF", "convert to markdown", "chunk for RAG", "extract tables", "analyze document structure", "prepare for ingestion", "process document", "evaluate docling output", "improve conversion quality".
Use this skill to parse, convert, chunk, and analyze documents with Docling.
It handles both local file paths and URLs, and outputs either Markdown or
structured JSON (DoclingDocument).
Conversion uses the docling CLI (installed with pip install docling).
The Python API is used only for features the CLI does not expose (chunking,
VLM remote-API endpoint configuration, hybrid force_backend_text mode).
| Task | Covered |
|---|---|
| Parse PDF / DOCX / PPTX / HTML / image | ✅ |
| Convert to Markdown | ✅ |
| Export as DoclingDocument JSON | ✅ |
| Chunk for RAG (hybrid: heading + token) | ✅ (Python API) |
| Analyze structure (headings, tables, figures) | ✅ (Python API) |
| OCR for scanned PDFs | ✅ (auto-enabled) |
| Multi-source batch conversion | ✅ |
Determine whether the user supplied a local path or a URL.
The docling CLI accepts both directly.
docling path/to/file.pdf
docling https://example.com/a.pdf
Docling has two pipeline families. Pick based on document type and hardware.
| Pipeline | CLI flag | Best for | Key tradeoff |
|---|---|---|---|
| Standard (default) | --pipeline standard | Born-digital PDFs, speed | No GPU needed; OCR for scanned pages |
| VLM | --pipeline vlm | Complex layouts, handwriting, formulas | Needs GPU; slower |
See pipelines.md for the full decision matrix, OCR engine table (EasyOCR, RapidOCR, Tesseract, macOS), and VLM model presets.
# Markdown (default output)
docling report.pdf --output /tmp/
# JSON (structured, lossless)
docling report.pdf --to json --output /tmp/
# VLM pipeline
docling report.pdf --pipeline vlm --output /tmp/
# VLM with specific model
docling report.pdf --pipeline vlm --vlm-model granite_docling --output /tmp/
# Custom OCR engine
docling report.pdf --ocr-engine tesserocr --output /tmp/
# Disable OCR or tables for speed
docling report.pdf --no-ocr --output /tmp/
docling report.pdf --no-tables --output /tmp/
# Remote VLM services
docling report.pdf --pipeline vlm --enable-remote-services --output /tmp/
The CLI writes output files to the --output directory, named after the
input file (e.g. report.pdf → report.md or report.json).
CLI reference: https://docling-project.github.io/docling/reference/cli/
Use the Python API when you need features the CLI does not expose:
chunking, VLM remote-API endpoint configuration, or hybrid
force_backend_text mode.
Docling 2.81+ API note: DocumentConverter(format_options=...) expects
dict[InputFormat, FormatOption] (e.g. InputFormat.PDF → PdfFormatOption).
Using string keys like {"pdf": PdfPipelineOptions(...)} fails at runtime with
AttributeError: 'PdfPipelineOptions' object has no attribute 'backend'.
Standard pipeline (default):
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
converter = DocumentConverter()
result = converter.convert("report.pdf")
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=PdfPipelineOptions(do_ocr=True, do_table_structure=True),
),
}
)
result = converter.convert("report.pdf")
VLM pipeline — local (GraniteDocling via HF Transformers):
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel import vlm_model_specs
from docling.pipeline.vlm_pipeline import VlmPipeline
pipeline_options = VlmPipelineOptions(
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
generate_page_images=True,
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
)
}
)
result = converter.convert("report.pdf")
VLM pipeline — remote API (vLLM / LM Studio / Ollama):
This is only available via the Python API; the CLI does not expose endpoint URL, model name, or API key configuration.
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
vlm_opts = ApiVlmOptions(
url="http://localhost:8000/v1/chat/completions",
params=dict(model="ibm-granite/granite-docling-258M", max_tokens=4096),
prompt="Convert this page to docling.",
response_format=ResponseFormat.DOCTAGS,
timeout=120,
)
pipeline_options = VlmPipelineOptions(
vlm_options=vlm_opts,
generate_page_images=True,
enable_remote_services=True, # required — gates all outbound HTTP
)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=VlmPipeline,
pipeline_options=pipeline_options,
)
}
)
result = converter.convert("report.pdf")
Hybrid mode (force_backend_text) — Python API only:
Uses deterministic PDF text extraction for text regions while routing images and tables through the VLM. Reduces hallucination on text-heavy pages.
pipeline_options = VlmPipelineOptions(
vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
force_backend_text=True,
generate_page_images=True,
)
result.document is a DoclingDocument object in all cases.
Markdown (default, human-readable):
docling report.pdf --to md --output /tmp/
Or via Python: result.document.export_to_markdown()
JSON / DoclingDocument (structured, lossless):
docling report.pdf --to json --output /tmp/
Or via Python: result.document.export_to_dict()
If the user does not specify a format, ask: "Should I output Markdown or structured JSON (DoclingDocument)?"
Chunking is only available via the Python API.
Default: hybrid chunker — splits first by heading hierarchy, then subdivides oversized sections by token count. This preserves semantic boundaries while respecting model context limits.
The tokenizer API changed in docling-core 2.8.0. Pass a BaseTokenizer
object, not a raw string:
HuggingFace tokenizer (default):
from docling.chunking import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
tokenizer = HuggingFaceTokenizer.from_pretrained(
model_name="sentence-transformers/all-MiniLM-L6-v2",
max_tokens=512,
)
chunker = HybridChunker(tokenizer=tokenizer, merge_peers=True)
chunks = list(chunker.chunk(result.document))
for chunk in chunks:
embed_text = chunker.contextualize(chunk)
print(chunk.meta.headings) # heading breadcrumb list
print(chunk.meta.origin.page_no) # source page number
OpenAI tokenizer (for OpenAI embedding models):
import tiktoken
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer
tokenizer = OpenAITokenizer(
tokenizer=tiktoken.encoding_for_model("text-embedding-3-small"),
max_tokens=8192,
)
# Requires: pip install 'docling-core[chunking-openai]'
For chunking strategies and tokenizer details, see the Docling documentation
on chunking and HybridChunker.
Use the DoclingDocument object directly to inspect structure:
doc = result.document
for item, level in doc.iterate_items():
if hasattr(item, 'label') and item.label.name == 'SECTION_HEADER':
print(f"{'#' * level} {item.text}")
for table in doc.tables:
print(table.export_to_dataframe()) # pandas DataFrame
print(table.export_to_markdown())
for picture in doc.pictures:
print(picture.caption_text(doc)) # caption if present
For the full API surface, see Docling's structure and table export docs.
After every conversion where the user cares about fidelity (not quick previews), run the bundled evaluator on the JSON export, then refine the pipeline if needed. This is how the agent checks its work and improves the run without guessing.
Step A — Produce JSON and optional Markdown
docling "<source>" --to json --output /tmp/
docling "<source>" --to md --output /tmp/
Step B — Evaluate
python3 scripts/docling-evaluate.py /tmp/<filename>.json --markdown /tmp/<filename>.md
If the user expects tables (invoices, spreadsheets in PDF), add
--expect-tables. Tighten gates with --fail-on-warn in CI-style checks.
The script prints a JSON report to stdout: status (pass | warn | fail),
metrics, issues, and recommended_actions (concrete docling CLI
flags to try next).
Step C — Refinement loop (max 3 attempts unless the user says otherwise)
status is warn or fail, apply one primary change from
recommended_actions (e.g. switch --pipeline vlm, change
--ocr-engine, ensure tables are enabled).docling, re-run scripts/docling-evaluate.py.status is pass, or after 3 iterations — then summarize what
worked and any remaining issues for the user.Step D — Self-improvement log (skill memory)
After a successful pass or after the final iteration, append one entry to improvement-log.md in this skill directory:
issues)status and one line of subjective quality notesThis log is optional for the user to git-ignore; it is for local learning so future runs on similar documents start closer to the right pipeline.
If scripts/docling-evaluate.py cannot run, still verify:
| Check | Action if bad |
|---|---|
| Page count matches source (roughly) | Re-run; try --pipeline vlm if layout is complex |
| Markdown is not near-empty | Enable OCR / VLM |
| Tables missing when visually obvious | Remove --no-tables; try --pipeline vlm |
\ufffd replacement characters | Different --ocr-engine or --pipeline vlm |
| Same line repeated many times | --pipeline vlm or hybrid force_backend_text (Python API) |
| Situation | Handling |
|---|---|
| Scanned / image-only PDF | Standard pipeline with OCR, or --pipeline vlm for best quality |
| Password-protected PDF | --pdf-password PASSWORD; will raise ConversionError if wrong |
| Very large document (500+ pages) | Standard pipeline with --no-tables for speed |
| Complex layout / multi-column | --pipeline vlm; standard may misorder reading flow |
| Handwriting or formulas | --pipeline vlm only — standard OCR will not handle these |
| URL behind auth | Pre-download to temp file; pass local path |
| Tables with merged cells | table.export_to_markdown() handles spans; VLM often more accurate |
| Non-UTF-8 encoding | Docling normalises internally; no special handling needed |
| VLM hallucinating text | force_backend_text=True via Python API for hybrid mode |
| VLM API call blocked | --enable-remote-services (CLI) or enable_remote_services=True (Python) |
| Apple Silicon | --vlm-model granite_docling with MLX backend, or GRANITEDOCLING_MLX preset (Python API) |
Full decision matrix, all OCR engine options, VLM model presets, and API server configuration: pipelines.md
status, top issues, and
which refinement attempt produced the final output.indent=2 unless the user specifies otherwise.pip install docling docling-core
# For OpenAI tokenizer support:
pip install 'docling-core[chunking-openai]'
The docling CLI is included with the docling package — no separate install needed.
Check installed versions (prefer distribution metadata — docling may not set __version__):
from importlib.metadata import version
print(version("docling"), version("docling-core"))