Name: Book Pdf Extract
Author: shuff57

Extract content from PDF textbooks into markdown format suitable for the bookSHelf pipeline. Primary method is Docling - layout-aware extraction with image support and math formula detection. Also supports LiteParse (local, CLI-based) and pdfplumber (fallback).

Prerequisites

Python 3.8+
PDF textbook file (.pdf) to process
LlamaParse: LlamaCloud API key for cloud extraction (via --llamaparse-key or LLAMA_CLOUD_API_KEY env variable)
Docling: pip install docling (optional, best for complex layouts)
Pdfplumber: pip install pdfplumber (fallback, simpler text extraction)

LlamaParse API Key Options

You can provide the LlamaParse API key in two ways:

Environment variable: Set LLAMA_CLOUD_API_KEY in your environment

Extract content from PDF textbooks into markdown format suitable for the bookSHelf pipeline. Primary method is Docling - layout-aware extraction with image support and math formula detection. Also supports LiteParse (local, CLI-based) and pdfplumber (fallback).

Prerequisites

Python 3.8+
PDF textbook file (.pdf) to process
LlamaParse: LlamaCloud API key for cloud extraction (via --llamaparse-key or LLAMA_CLOUD_API_KEY env variable)
Docling: pip install docling (optional, best for complex layouts)
Pdfplumber: pip install pdfplumber (fallback, simpler text extraction)

LlamaParse API Key Options

You can provide the LlamaParse API key in two ways:

Environment variable: Set LLAMA_CLOUD_API_KEY in your environment

Problem	Action
PDF file not found	Report error with file path
File too large (>1GB)	Warn user, ask to proceed or use chunked extraction
API key not set for LlamaParse	Fall back to pdfplumber automatically
Extraction fails with all methods	Report detailed error, suggest manual scraping
Poor quality output	Suggest using docling with better layout handling

Mistake	Fix
Not setting LlamaParse API key	Use docling or pdfplumber instead
Using image-heavy PDFs	Recommend OCR approach or manual scraping
Ignoring extraction warnings	Review output, consider alternative sources
Not validating output	Always check extracted markdown quality
Processing corrupted PDFs	Use PDF repair tool first or manual extraction

Book Pdf Extract

Prerequisites

LlamaParse API Key Options

Book Pdf Extract

Prerequisites

LlamaParse API Key Options

When to Use

When NOT to Use

Guardrails

Extraction Methods

Method 1: LlamaParse (Best for Textbooks)

Method 2: Docling (Primary - Recommended)

Method 3: Pdfplumber (Fallback)

Quick Start

Workflow

Phase 1: Validate Input

Phase 2: Extract Content ( Method selection)

Phase 3: Format Output

Phase 4: Validation

Scripts

Error Handling

Common Mistakes

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing