Use when converting tabletop RPG PDFs or raw OCR markdown into clean, structured manuscripts. This skill is designed for long, ugly, multi-pass recovery work: column-aware extraction, OCR artifact triage, heading and table reconstruction, paragraph repair, and quality-gated manuscript output. It is explicitly written so weaker agents can follow a disciplined workflow instead of improvising.
This skill is the repo's full recovery workflow for turning RPG PDFs and OCR scrapes into usable markdown manuscripts.
It is not just a converter recipe. It is a processing discipline:
Use this skill when the source is:
.raw.md file produced from a PDF toolProduce markdown that is:
The target is working manuscript quality, not false perfection.
Weaker agents usually fail OCR recovery in one of four ways:
Do not do that.
Treat OCR repair as forensic editorial work.
Always work in phases.
Never destroy or overwrite the source PDF or raw OCR file.
Minimum output set:
.raw.md extraction output.clean.md working manuscriptOptional:
.ocr-report.md audit reportBefore rewriting anything substantial:
Use the audit script:
python scripts/ocr_markdown_audit.py path/to/file.raw.md
For PDF extraction:
python scripts/pdf_to_markdown.py path/to/book.pdf path/to/output-dir
Read:
references/ocr-artifact-taxonomy.mdreferences/quality-gates-and-escalation.mdreferences/triage-worksheet.mdreferences/document-profiles.mdRepair these before prose-level cleanup:
If you skip this order, later paragraph cleanup will blur content that should have stayed separated.
Once structure is stable:
Do not silently lore-edit ambiguous words unless you can justify them.
RPG books need domain-aware repair.
Examples:
E; in
metadata lines like E RANK 1 or E RANGE: Short, treat that leading E
as a broken bullet marker, not as semantic textUse:
references/table-reconstruction-manual.mdreferences/repair-playbook.mdreferences/high-confidence-corrections.mdreferences/pdf-visual-comparison-and-illustrations.mdBefore you consider the pass done, verify:
Use these references in order:
TODO.md
The phased project plan for building and improving this capabilityreferences/ocr-artifact-taxonomy.md
What kinds of OCR damage exist and how to recognize themreferences/repair-playbook.md
Concrete repair methods by artifact classreferences/table-reconstruction-manual.md
Table-specific heuristics and safe reconstruction rulesreferences/quality-gates-and-escalation.md
When to trust automation, when to stop, and how to review resultsreferences/document-profiles.md
How to choose a cleanup profile before processingreferences/agent-turn-template.md
The standard turn shape weaker agents should followreferences/review-checklist.md
How to spot-check a long cleanup passreferences/triage-worksheet.md
How to decide automation depthreferences/high-confidence-corrections.md
Safe correction patterns and repo-specific OCR repairsreferences/calibration-examples.md
Before-and-after examples to calibrate output qualityreferences/when-not-to-repair-automatically.md
Examples of where automation should stopreferences/repo-calibration-corpus.md
Real raw-to-clean examples from this repositoryreferences/pdf-visual-comparison-and-illustrations.md
How to compare visually against PDFs and preserve illustrations safelyWhen the user asks to process a document:
.clean.md working file.markdownlint on the cleaned file.Chapter N assumptions.Spells & Sorcerers, normalize OCR spell metadata that begins with a
leading E into plain markdown bullets such as - Rank: and - Range:
instead of preserving the broken glyph surrogate./illustrations by
default, preserve transparency by default, and insert at original source
position by default unless the user specifies otherwise.Preferred extractor:
pymupdf4llmWhy:
Primary script:
python scripts/pdf_to_markdown.py path/to/book.pdf path/to/output-dir --profile supplement
Audit script:
python scripts/ocr_markdown_audit.py path/to/file.raw.md
python scripts/ocr_markdown_audit.py path/to/file.raw.md path/to/file.clean.md
Flattened table helper:
python3 scripts/repair_flattened_tables.py path/to/file.clean.md --write
Section split helper:
python3 scripts/split_markdown_sections.py path/to/file.clean.md output-dir --level 2
python3 scripts/split_markdown_sections.py path/to/file.clean.md output-dir --pattern '^## '
Rendered PDF comparison:
pdftotext -layout -f START_PAGE -l END_PAGE path/to/book.pdf -
Use smaller page windows when mixed columns or sidebars make a large extract hard to interpret reliably.
Available profiles:
defaultcorebooksupplementspell-compendiumbestiarylifepath-generatorUse:
./.tools/markdownlint/node_modules/.bin/markdownlint path/to/file.clean.md
When time is limited, fix in this order:
It does not mean ornate prose or aggressive rewriting.
It means:
If a weaker agent follows this skill correctly, it should be able to produce a clean working manuscript that a stronger agent can refine instead of rebuild.