Build a translated PDF that closely matches the source document by separating extraction, translation, layout rebuilding, asset preservation, and PDF export into distinct steps. Use for legal, compliance, contract, policy, and form-heavy PDFs where a clean translated deliverable should approximate a professionally rebuilt document rather than a raw in-place overlay.
Use this skill when the translated document should read like a proper target-language document and still resemble the source visually.
font_baseline captured in the manifest.html & .cssDo not treat translation and page reconstruction as the same problem.
Translation accuracy and visual fidelity are equally important. Do not optimize one by quietly sacrificing the other.
This skill is deliberately hybrid. Some parts should be scripted, but this is not a 100% scripted workflow. The operator must inspect preview renders, notice when the chosen strategy is wrong, and switch tactics before committing to the final PDF.
The pipeline is:
Produce:
source.md: clean reading-order text in the source languageblocks.json: page-level structured content with bbox, role, alignment, list/table metadata, and inferred styleassets/: extracted logos, signatures, stamps, lines, boxes, and reusable imagesExtract semantic blocks, not OCR lines. Merge wrapped lines into paragraphs before translation.
Classify each page early:
digital: stable native text layer, mostly vector/text contentscanned: OCR required, page image is the main truthmixed: native text plus significant embedded imagery or page-artTreat classification as a routing decision, not a label for the notes file.
Preferred handling by class:
digital: preserve the source page geometry and typography cues; overlay or rebuild depending on page-art complexityscanned: preserve the rendered page image and replace only the translated text regionsmixed: preserve the visual template while translating text blocks and keeping image regions untouchedAlways render preview images first. Inspect at least the first page, one dense body page, and one page with forms/tables/signatures before choosing a final layout strategy. Do this before making assumptions about whether a document should be rebuilt or overlaid.
Choose a document-level fallback font baseline from those preview renders before trusting extracted font metadata:
serif or sansfont_baselinescripts/extract_document.py --font-baseline serif|sans or scripts/run_babel_copy.py --font-baseline serif|sans when you have already made the visual callBaseline mapping:
serif -> PDF overlay fallback Times-Roman; DOCX rebuild fallback Times New Romansans -> PDF overlay fallback helv; DOCX rebuild fallback ArialClean OCR noise during extraction:
Do not drop headers, footers, table headers, labels, captions, page titles, or repeated boilerplate just because they recur. If a human reader would read it, it should be translated unless it is clearly a non-translatable identifier.
If the page is a hard scan, preserve both the extracted text blocks and the rendered page image for fallback composition.
Primary extractor:
scripts/extract_document.pyRun it to create:
source.mdblocks.jsonassets/font_baseline metadata for later fallback-font decisionsProduce:
translated.mdtranslated_blocks.jsonTranslate with context:
When the document is complex, it is acceptable to delegate specific components or blocks to codex exec or sub-agents for focused translation or inspection. Use this to speed up work, not to fragment terminology. Keep one shared glossary/context for the full document.
If no API or local MT backend is available or desired, use a manual phrase-map flow:
scripts/babel_copy_manual.py prepare-blocksscripts/babel_copy_manual.py apply-blocksextract and apply still existDefault target: .docx
Use Word-style paragraph layout for legal documents unless HTML, React, canvas, or another intermediate layout format is materially better for the specific page class. The rebuilt document should:
Approximate the source, but do not keep dirty scan backgrounds unless they are necessary to preserve meaning.
When a block lacks good font data, or the extracted source font is not embedded / not reusable, rebuild against the document font_baseline instead of blindly trusting style.font_name.
Choose layout strategy per document class:
source-page overlay: best for branded native PDFs, complex visual templates, scan-heavy pages, arrows/flow charts, or any page where the original is the best layout templatestructural rebuild: best for forms, bordered tables, signature pages, and clean digital reports where rebuilt text improves readability without losing recognizabilityhybrid staging: acceptable when HTML/React/canvas or another intermediate format gives better control before final PDF renderingDo not force the entire document through one strategy if that strategy is obviously wrong for part of it. Prefer modularity.
Current bundled rebuild path:
scripts/rebuild_docx.pyscripts/export_pdf.pyscripts/build_final_pdf.pyscripts/run_babel_copy.pyThis now supports:
.docx rebuild.docx rebuild fallback for form/table/signature pages inside the same documentcodex exec translation -> hybrid final PDF build -> rendered comparison report -> check notesIt is still not a fully page-faithful legal-document engine, but it now covers the high-leverage structure needed for form-heavy signature pages.
Prefer a clean rebuilt page with selectively preserved assets over blanket overlay on the original scan.
Preserve:
Fallback only when reconstruction is too risky:
For branded native PDFs that already have stable page art, prefer the original source page as the visual template and overlay translated semantic blocks onto it. This keeps logos, rules, footers, numbering, and page rhythm intact.
Overlay color must match the source background whenever possible. Sample the source page. If sampling is unreliable because of gradients, scan noise, or uneven tint, normalize the translated region cleanly instead of leaving an obvious mismatch.
Always render the final PDF and compare it against the source.
Run a check step before declaring success:
scripts/compare_rendered_pages.pyIf visual QA reveals overlapping text or any other local layout problem that does not justify changing extraction or renderer logic, use a targeted post-process override pass:
translated_blocks.json, not blocks.jsoncustom_override only to the affected block(s)custom_override are treated as deltas by defaultleft, top, right, bottom and to numeric style fields such as font_size_hint+12 or -4 when you want relative movement or expansion"533.0" rather than 533.0scripts/build_final_pdf.pyscripts/compare_rendered_pages.pyUse overrides for document-specific cleanup, not for systematic bugs that should be fixed in the pipeline itself.
Visually inspect:
Final notes must state:
overlay, rebuild, or hybrid)custom_override adjustments were applied after comparisonThis skill now ships its own bundled scripts:
scripts/core.py: local extraction and composition primitivesscripts/extract_document.py: source text, block manifest, and asset extractionscripts/babel_copy_manual.py: manual extract/apply bootstrap flowscripts/rebuild_docx.py: minimal .docx rebuild from translated blocksscripts/export_pdf.py: LibreOffice-based PDF exportscripts/build_final_pdf.py: chooses overlay-vs-rebuild final PDF rendering per pagescripts/run_babel_copy.py: preferred non-API workflow runner for full jobsscripts/compare_rendered_pages.py: side-by-side visual QA helper for reviewscripts/translate_blocks_codex.py: block translation through codex execCurrent limitation:
.docx pathreferences/pipeline.md when setting up or extending the five-stage pipeline.references/block-schema.md when designing extraction or rebuild artifacts.