Use when extracting content from a PDF textbook into markdown format for processing through the bookSHelf pipeline. Supports LlamaParse for advanced document structure detection, tables, and math expressions.
Extract content from PDF textbooks into markdown format suitable for the bookSHelf pipeline. Primary method is Docling - layout-aware extraction with image support and math formula detection. Also supports LiteParse (local, CLI-based) and pdfplumber (fallback).
--llamaparse-key or LLAMA_CLOUD_API_KEY env variable)pip install docling (optional, best for complex layouts)pip install pdfplumber (fallback, simpler text extraction)You can provide the LlamaParse API key in two ways:
LLAMA_CLOUD_API_KEY in your environment--llamaparse-key <your-api-key> when calling the extraction scriptExample:
# Using environment variable
export LLAMA_CLOUD_API_KEY=your_api_key_here
python pdf_extract.py --pdf input.pdf --output output.md --method llamaparse
# Or using command-line argument
python pdf_extract.py --pdf input.pdf --output output.md --method llamaparse --llamaparse-key your_api_key_here
⚠️ Must NOT:
- Skip validation of output markdown
- Overwrite existing markdown files without confirmation
- Process corrupted or malformed PDFs without user warning
- Use LlamaParse without API key if using cloud features
LLAMA_CLOUD_API_KEY environment variable or use --llamaparse-key argumentpip install llama-parse<!-- image --> placeholders## headerspip install doclingdocling <file.pdf> --output <file.md>pip install pdfplumber Pillowfull.md in extracted directory| Script | Purpose | Input | Output |
|---|---|---|---|
bookSHelf/scripts/workflows/pdf_extract.py | Default extraction (auto-selects best method, supports llamaparse with --llamaparse-key) | PDF path | full.md in source_files/extracted/ |
| Problem | Action |
|---|---|
| PDF file not found | Report error with file path |
| File too large (>1GB) | Warn user, ask to proceed or use chunked extraction |
| API key not set for LlamaParse | Fall back to pdfplumber automatically |
| Extraction fails with all methods | Report detailed error, suggest manual scraping |
| Poor quality output | Suggest using docling with better layout handling |
| Mistake | Fix |
|---|---|
| Not setting LlamaParse API key | Use docling or pdfplumber instead |
| Using image-heavy PDFs | Recommend OCR approach or manual scraping |
| Ignoring extraction warnings | Review output, consider alternative sources |
| Not validating output | Always check extracted markdown quality |
| Processing corrupted PDFs | Use PDF repair tool first or manual extraction |