技能档案

Docling Document Intelligence Skill

Name: Docling Document Intelligence Skill
Author: docling-project

Parse, convert, chunk, and analyze documents using Docling. Use this skill when the user provides a document (PDF, DOCX, PPTX, HTML, image) as a file path or URL and wants to: extract text or structured content, convert to Markdown or JSON, chunk the document for RAG ingestion, analyze document structure (headings, tables, figures, reading order), or run quality evaluation with iterative pipeline tuning. Triggers: "parse this PDF", "convert to markdown", "chunk for RAG", "extract tables", "analyze document structure", "prepare for ingestion", "process document", "evaluate docling output", "improve conversion quality".

docling-project58,091 星标2026年4月13日

职业
分类: 文档

技能内容

Use this skill to parse, convert, chunk, and analyze documents with Docling. It handles both local file paths and URLs, and outputs either Markdown or structured JSON (DoclingDocument).

Conversion uses the docling CLI (installed with pip install docling). The Python API is used only for features the CLI does not expose (chunking, VLM remote-API endpoint configuration, hybrid force_backend_text mode).

Scope

Task	Covered
Parse PDF / DOCX / PPTX / HTML / image	✅
Convert to Markdown	✅
Export as DoclingDocument JSON	✅
Chunk for RAG (hybrid: heading + token)	✅ (Python API)
Analyze structure (headings, tables, figures)	✅ (Python API)
OCR for scanned PDFs	✅ (auto-enabled)
Multi-source batch conversion	✅

Step-by-Step Instructions

相关技能

Docling Document Intelligence Skill | Skills Pool

docling path/to/file.pdf
docling https://example.com/a.pdf

Pipeline	CLI flag	Best for	Key tradeoff
Standard (default)	`--pipeline standard`	Born-digital PDFs, speed	No GPU needed; OCR for scanned pages
VLM	`--pipeline vlm`	Complex layouts, handwriting, formulas	Needs GPU; slower

# Markdown (default output)
docling report.pdf --output /tmp/

# JSON (structured, lossless)
docling report.pdf --to json --output /tmp/

# VLM pipeline
docling report.pdf --pipeline vlm --output /tmp/

# VLM with specific model
docling report.pdf --pipeline vlm --vlm-model granite_docling --output /tmp/

# Custom OCR engine
docling report.pdf --ocr-engine tesserocr --output /tmp/

# Disable OCR or tables for speed
docling report.pdf --no-ocr --output /tmp/
docling report.pdf --no-tables --output /tmp/

# Remote VLM services
docling report.pdf --pipeline vlm --enable-remote-services --output /tmp/

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

converter = DocumentConverter()
result = converter.convert("report.pdf")

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=PdfPipelineOptions(do_ocr=True, do_table_structure=True),
        ),
    }
)
result = converter.convert("report.pdf")

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel import vlm_model_specs
from docling.pipeline.vlm_pipeline import VlmPipeline

pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
    generate_page_images=True,
)
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        )
    }
)
result = converter.convert("report.pdf")

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions
from docling.datamodel.pipeline_options_vlm_model import ApiVlmOptions, ResponseFormat
from docling.pipeline.vlm_pipeline import VlmPipeline

vlm_opts = ApiVlmOptions(
    url="http://localhost:8000/v1/chat/completions",
    params=dict(model="ibm-granite/granite-docling-258M", max_tokens=4096),
    prompt="Convert this page to docling.",
    response_format=ResponseFormat.DOCTAGS,
    timeout=120,
)
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_opts,
    generate_page_images=True,
    enable_remote_services=True,  # required — gates all outbound HTTP
)
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        )
    }
)
result = converter.convert("report.pdf")

pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
    force_backend_text=True,
    generate_page_images=True,
)

docling report.pdf --to md --output /tmp/

docling report.pdf --to json --output /tmp/

from docling.chunking import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer

tokenizer = HuggingFaceTokenizer.from_pretrained(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    max_tokens=512,
)
chunker = HybridChunker(tokenizer=tokenizer, merge_peers=True)
chunks = list(chunker.chunk(result.document))

for chunk in chunks:
    embed_text = chunker.contextualize(chunk)
    print(chunk.meta.headings)        # heading breadcrumb list
    print(chunk.meta.origin.page_no)  # source page number

import tiktoken
from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer

tokenizer = OpenAITokenizer(
    tokenizer=tiktoken.encoding_for_model("text-embedding-3-small"),
    max_tokens=8192,
)
# Requires: pip install 'docling-core[chunking-openai]'

doc = result.document

for item, level in doc.iterate_items():
    if hasattr(item, 'label') and item.label.name == 'SECTION_HEADER':
        print(f"{'#' * level} {item.text}")

for table in doc.tables:
    print(table.export_to_dataframe())   # pandas DataFrame
    print(table.export_to_markdown())

for picture in doc.pictures:
    print(picture.caption_text(doc))     # caption if present

docling "<source>" --to json --output /tmp/
docling "<source>" --to md --output /tmp/

python3 scripts/docling-evaluate.py /tmp/<filename>.json --markdown /tmp/<filename>.md

Check	Action if bad
Page count matches source (roughly)	Re-run; try `--pipeline vlm` if layout is complex
Markdown is not near-empty	Enable OCR / VLM
Tables missing when visually obvious	Remove `--no-tables`; try `--pipeline vlm`
`\ufffd` replacement characters	Different `--ocr-engine` or `--pipeline vlm`
Same line repeated many times	`--pipeline vlm` or hybrid `force_backend_text` (Python API)

Situation	Handling
Scanned / image-only PDF	Standard pipeline with OCR, or `--pipeline vlm` for best quality
Password-protected PDF	`--pdf-password PASSWORD`; will raise `ConversionError` if wrong
Very large document (500+ pages)	Standard pipeline with `--no-tables` for speed
Complex layout / multi-column	`--pipeline vlm`; standard may misorder reading flow
Handwriting or formulas	`--pipeline vlm` only — standard OCR will not handle these
URL behind auth	Pre-download to temp file; pass local path
Tables with merged cells	`table.export_to_markdown()` handles spans; VLM often more accurate
Non-UTF-8 encoding	Docling normalises internally; no special handling needed
VLM hallucinating text	`force_backend_text=True` via Python API for hybrid mode
VLM API call blocked	`--enable-remote-services` (CLI) or `enable_remote_services=True` (Python)
Apple Silicon	`--vlm-model granite_docling` with MLX backend, or `GRANITEDOCLING_MLX` preset (Python API)

pip install docling docling-core
# For OpenAI tokenizer support:
pip install 'docling-core[chunking-openai]'

from importlib.metadata import version
print(version("docling"), version("docling-core"))

Docling Document Intelligence Skill

Scope

Step-by-Step Instructions

Docling Document Intelligence Skill

Scope

Step-by-Step Instructions

1. Resolve the input

2. Choose a pipeline

3. Convert the document

CLI (preferred for straightforward conversions)

Python API (for advanced features)

4. Choose output format

5. Chunk for RAG (hybrid strategy)

6. Analyze document structure

7. Evaluate output and iterate (required for "best effort" conversions)

8. Agent quality checklist (manual, if script unavailable)

Common Edge Cases

Pipeline reference

Output conventions

Dependencies

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing