Complex document parsing with PaddleOCR. Intelligently converts complex PDFs and document images into Markdown and JSON files that preserve the original structure.
Use Document Parsing for:
Use Text Recognition instead for:
⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔
python scripts/vl_caller.pyIf the script execution fails (API not configured, network error, etc.):
Execute document parsing:
python scripts/vl_caller.py --file-url "URL provided by user" --pretty
Or for local files:
python scripts/vl_caller.py --file-path "file path" --pretty
Optional: explicitly set file type:
python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0 --pretty
--file-type 0: PDF--file-type 1: imageDefault behavior: save raw JSON to a temp file:
--output is omitted, the script saves automatically under the system temp directory<system-temp>/paddleocr/doc-parsing/results/result_<timestamp>_<id>.json--output is provided, it overrides the default temp-file destination--stdout is provided, JSON is printed to stdout and no file is savedResult saved to: /absolute/path/...--stdout only when you explicitly want to skip file persistenceThe output JSON contains COMPLETE content with all document data:
Input type note:
Extract what the user needs from the output JSON using these fields:
textresult[n].markdownresult[n].prunedResultCRITICAL: You must display the COMPLETE extracted content to the user based on their needs.
text fieldWhat this means:
text, result[n].markdown, and result[n].prunedResultExample - Correct:
User: "Extract all the text from this document"
Agent: I've parsed the complete document. Here's all the extracted text:
[Display entire text field or concatenated regions in reading order]
Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)
Example - Incorrect:
User: "Extract all the text"
Agent: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"
The output JSON uses an envelope wrapping the raw API result:
{
"ok": true,
"text": "Full markdown/HTML text extracted from all pages",
"result": { ... }, // raw provider response
"error": null
}
Key fields:
text — extracted markdown text from all pages (use this for quick text display)result - raw provider response objectresult[n].prunedResult - structured parsing output for each page (layout/content/confidence and related metadata)result[n].markdown — full rendered page output in markdown/HTMLRaw result location (default): the temp-file path printed by the script on stderr
Example 1: Extract Full Document Text
python scripts/vl_caller.py \
--file-url "https://example.com/paper.pdf" \
--pretty
Then use:
text for quick full-text outputresult[n].markdown when page-level output is neededExample 2: Extract Structured Page Data
python scripts/vl_caller.py \
--file-path "./financial_report.pdf" \
--pretty
Then use:
result[n].prunedResult for structured parsing data (layout/content/confidence)result[n].markdown for rendered page contentExample 3: Print JSON Without Saving
python scripts/vl_caller.py \
--file-url "URL" \
--stdout \
--pretty
Then return:
text when user asks for full document contentresult[n].prunedResult and result[n].markdown when user needs complete structured page dataYou can generally assume that the required environment variables have already been configured. Only when a parsing task fails should you analyze the error message to determine whether it is caused by a configuration issue. If it is indeed a configuration problem, you should notify the user to fix it.
When API is not configured:
The error will show:
CONFIG_ERROR: PADDLEOCR_DOC_PARSING_API_URL not configured. Get your API at: https://paddleocr.com
Configuration workflow:
Show the exact error message to the user (including the URL).
Guide the user to configure securely:
- PADDLEOCR_DOC_PARSING_API_URL
- PADDLEOCR_ACCESS_TOKEN
- Optional: PADDLEOCR_DOC_PARSING_TIMEOUT
If the user provides credentials in chat anyway (accept any reasonable format), for example:
PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...Here's my API: https://xxx and token: abc123Then parse and validate the values:
PADDLEOCR_DOC_PARSING_API_URL (look for URLs with paddleocr.com or similar)PADDLEOCR_DOC_PARSING_API_URL is a full endpoint ending with /layout-parsingPADDLEOCR_ACCESS_TOKEN (long alphanumeric string, usually 40+ chars)Ask the user to confirm the environment is configured.
Retry only after confirmation:
There is no file size limit for the API. For PDFs, the maximum is 100 pages per request.
Tips for large files:
For very large local files, prefer --file-url over --file-path to avoid base64 encoding overhead:
python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf"
If you only need certain pages from a large PDF, extract them first:
# Extract pages 1-5
python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5"
# Mixed ranges are supported
python scripts/split_pdf.py large.pdf selected_pages.pdf --pages "1-5,8,10-12"
# Then process the smaller file
python scripts/vl_caller.py --file-path "pages_1_5.pdf"
Authentication failed (403):