Merge scanned book Part PDFs into whole books, apply Extra page replacements, detect split boundaries from the contents and page images, and export per-section or per-unit PDFs with the book name prefixed. Use this when the user wants scanned study books processed from filenames like "<book> Part <n>.pdf" and "<book> Extra <n> Page <xxx>.pdf".
Use this skill for scanned study-book PDFs that follow the Part / Extra filename pattern.
This workflow covers two recurring tasks:
Part files into one whole-book PDF.Extra files as replacements for book pages and then split the whole book into atomic units or sections.Prefer these local tools when available:
pdfunite for merging Part PDFs in orderqpdf for rebuilding PDFs from selected page rangespdfinfo for total page countspdftoppm for rendering pages to imagestesseract for OCR on contents pages or title pagesIf image inspection is needed, render contact sheets and inspect the page images instead of guessing from OCR alone.
Treat scanned-book boundary detection as a visual layout task first and an OCR task second.
The primary way to "read" the first pages of a scanned book is:
pdftoppmContents, divider pages, titles, and visible printed page numbers from the rendered imagesIn other words, prefer human-style visual inspection of rendered page images over trying to turn the whole scan into machine-readable text first.
Do not start by OCRing large page ranges. For scanned books, that is usually slower and less reliable than inspecting a small set of rendered page images. The fastest dependable path is usually:
pdfinfo, a small render, or a user-provided number you verify.1-12.Contents page visually.If pdftotext returns blank output or mostly form-feed characters, assume the PDF is image-only and pivot immediately to rendered-page inspection.
Use this workflow by default unless the PDF already has high-quality embedded text.
The goal is to find a single, consistent offset
offset = pdf_page - printed_page
for the whole book, then reuse it everywhere (including Extra replacements).
In interactive runs, do not run pdftoppm, OCR, or offset math until you have gone through the questions below in the conversation with the user. Describing this section internally is not enough — you must actually ask (or clearly restate what you need and wait for a reply when the task requires user input). If the user has already answered in the same thread, reuse those answers. If there is no reply and the task must proceed, infer the offset using §1b/§1c, then report that fallback and confidence in the final response.
Question A — offset known? Ask whether they already know the offset for this book, using the same definition as everywhere else in this skill:
offset = pdf_page - printed_page
(one number for the whole book). Example phrasing: “Do you already know the printed-page ↔ PDF-page offset for this book (i.e. pdf_page = printed_page + offset)? If yes, what is offset?”
After Question A:
If they give a number: treat it as a hypothesis, not gospel. Verify it with a few spot checks (e.g. find printed page 1 or match a contents line to a rendered page). If checks agree, use their offset. If not, say what failed and only then fall back to inferring (§1b–§1c) or ask them to double-check.
If they do not know the offset: ask Question B before you infer.
Question B — where are page numbers? Ask where the printed page number usually appears on each page (e.g. footer vs header, left/center/right, odd/even pages different, or “no printed numbers on divider pages”). Example phrasing: “Where does the printed page number usually appear in this scan — footer, header, which corner, and does it differ on odd/even pages?”
After Question B:
If they answer: use that layout hint when rendering small ranges — look in the stated region first when matching printed numbers to PDF pages.
If they skip or decline (no answer): infer where page numbers appear by visual inspection of rendered pages (see §2 for rendering; use §1b or §1c for the offset).
For many workbooks, the offset is small (e.g. 2–4 pages of front matter). Instead of OCRing, do this:
Render just the first few PDF pages to images:
pdftoppm -f 1 -l 8 -png "book.pdf" /tmp/book-front
Visually scan those PNGs and find the first page whose printed footer/header says 1.
Let k be that PDF page index (e.g. k = 5 if book-front-005.png shows printed page 1).
Then the book-wide offset is:
offset = k - 1
Use that same offset to map any printed page p to its PDF page:
pdf_page = p + offset
This is usually faster and more reliable than trying to OCR page numbers.
You can also use the contents page together with visible printed page numbers on real content pages:
p.offset = pdf_page - printed_page from that match.Verify the offset from at least two or three spots, not just the first chapter. Once verified, convert all printed starts with:
pdf_start = printed_start + offset
Start with the total page count and a small render window. Do not render the whole book yet.
pdfinfo "book.pdf"
pdftoppm -f 1 -l 12 -png "book.pdf" /tmp/book-scan
Look through those rendered pages for:
PrefaceContentsSection AThe contents page often gives most or all unit starts immediately. Read it from the rendered page image first.
When the page is legible to the eye but OCR is noisy, continue reading the rendered image directly instead of forcing more OCR. The rendered PNG is the source of truth unless there is a specific reason to extract text.
Do not bulk-OCR the first 20, 50, or 200 pages just because the book is scanned. That usually wastes time and still leaves ambiguity.
Only OCR a specific page when:
Example targeted OCR:
tesseract /tmp/book-scan-004.png stdout --psm 6
Do not trust a single early match. Render a few deeper candidate start pages from different parts of the book and confirm they really are the expected unit starts.
Example:
pdftoppm -f 23 -l 23 -png "book.pdf" /tmp/check-ch2
pdftoppm -f 44 -l 44 -png "book.pdf" /tmp/check-ch3
pdftoppm -f 145 -l 145 -png "book.pdf" /tmp/check-section-b
This is usually faster and more accurate than OCRing large ranges.
Books often include section title pages such as Section A immediately before the first chapter content page.
Do not assume those divider pages must be split separately. If the printed page numbering and user intent suggest it, group a section divider page with the following chapter or unit.
State that grouping decision explicitly in the final response or manifest.
OCR is a fallback for specific pages, not the default engine for the whole task.
Good OCR use:
Answers or SolutionsPoor OCR use:
pdftotext is blank: stop text extraction and switch to images.Part files by shared basename before Part <n>.pdf.Part files.<book name>.pdf.Part and Extra files untouched unless the user later asks to trash them.Use this when files match "<book name> Extra <n> Page <xxx>.pdf".
Rules:
Page <xxx> number is the first printed book page to replace.Extra PDF replaces 2 consecutive printed book pages.Page 187 replaces book pages 187-188.Procedure:
0. If the user gave an offset, verify it; if not, infer and verify using contents pages, visible page numbers, or page images (§1b–§1c).Extra source PDFs.Preface + TOC should be split separately or combined.Answers or Solutions, verify the actual transition page visually before splitting.Merged file:
<book name>.pdfSplit folder:
<book name> splitSplit files:
<book name> - 00 Preface + TOC.pdf<book name> - 01 ...Keep zero-padded numeric prefixes so files sort naturally.
After merge or split:
pdfinfoExtra replacements, verify the inserted pages visually or by rendered-image comparisonIf a book has already been analyzed, save or reuse a lightweight per-book boundary manifest if helpful. Generate a draft automatically when possible; do not require the user to hand-author one unless the structure is ambiguous.
Edit PDFs with natural-language instructions using the nano-pdf CLI.