PDF reading and text extraction
This skill provides enterprise-grade PDF processing using the pdf-reader MCP server (5-10x faster parallel processing) and Python fallbacks.
| Task | Method | Tool |
|---|---|---|
| Extract text | MCP or pdfplumber | pdf-reader / Bash |
| Extract tables | pdfplumber | Bash |
| Merge PDFs | pypdf | Bash |
| Split PDFs | pypdf / qpdf | Bash |
| OCR scanned | pytesseract | Bash |
| Read from URL | Download + Process | WebFetch + Bash |
{
"sources": [{
"path": "/path/to/document.pdf"
}],
"operation": "extract",
"pages": "all"
}
{
"sources": [{
"path": "document.pdf"
}],
"operation": "extract",
"pages": "1-5"
}
{
"sources": [{
"path": "document.pdf"
}],
"operation": "metadata"
}
For PDFs from URLs (like HubFS, S3, etc.):
curl -L -o /tmp/document.pdf "https://example.com/file.pdf"
{
"sources": [{"path": "/tmp/document.pdf"}],
"operation": "extract"
}
rm /tmp/document.pdf
When MCP is unavailable, use Python:
pip install pypdf pdfplumber reportlab pytesseract pdf2image
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)
import pdfplumber
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
df = pd.DataFrame(table[1:], columns=table[0])
print(df)
import pytesseract
from pdf2image import convert_from_path
images = convert_from_path('scanned.pdf')
for i, image in enumerate(images):
text = pytesseract.image_to_string(image, lang='jpn') # Japanese
print(f"Page {i+1}:\n{text}")
# Install on macOS
brew install poppler
# Extract text preserving layout
pdftotext -layout input.pdf output.txt
# Extract specific pages
pdftotext -f 1 -l 5 input.pdf output.txt
# Install
brew install qpdf
# Merge PDFs
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
For PDFs requiring advanced processing:
# PDF Text Extractor (jirimoravcik)
npx apify-cli run jirimoravcik/pdf-text-extractor \
-i '{"pdfUrls": ["https://example.com/file.pdf"]}'
See references/apify-actors.md for full Apify integration.
# Download
curl -L -o /tmp/anthropic-guide.pdf \
"https://resources.anthropic.com/hubfs/The-Complete-Guide-to-Building-Skill-for-Claude.pdf"
# Extract with MCP
# Use pdf-reader tool with path: /tmp/anthropic-guide.pdf
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
from pypdf import PdfReader, PdfWriter
watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()
for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)
with open("watermarked.pdf", "wb") as output:
writer.write(output)
| Issue | Solution |
|---|---|
| Empty text | PDF may be scanned - use OCR |
| Garbled characters | Check encoding, try different library |
| Tables broken | Use pdfplumber with explicit table settings |
| Large file slow | Use page ranges, parallel processing |