Extract structured FF&E product specs from PDF files — price books, fact sheets, and spec sheets. Claude reads extracted text and structures products into a standardized schedule.
Extract structured FF&E data from product PDF files — price books, fact sheets, configurator sheets, and spec sheets. Uses PyMuPDF for text extraction and Claude's reasoning to parse wildly varying PDF layouts into a standardized schedule.
The user provides PDFs in one of these ways:
.pdf files)Also ask (or use defaults):
expand (one row per variant/SKU, default) or summarize (comma-separated variants in one row)Products are written to the — the same 33-column schema used by all product skills, plus PDF-specific extra columns. When writing to CSV, use the same column order.
Read ../../schema/product-schema.md (relative to this SKILL.md) for the full column reference, field formats, and category vocabulary. Read ../../schema/sheet-conventions.md for CRUD patterns with MCP tools.
Skill-specific column values:
pdf-parsersavedPDFs contain fields that don't have dedicated master columns. Append these to Notes using | as delimiter:
Variant: Diamond, BlackPrice adder: +$130 (PostureFit SL)Origin: SwedenSource: alphabeta-fact-sheet.pdfExample Notes cell: Variant: Diamond, Black | Origin: Sweden | Source: alphabeta-fact-sheet.pdf
Different PDF types require different approaches:
expand vs summarize modeParse the user's input to identify PDF file(s) and output preferences.
.pdf files and report countexpand unless the user says otherwiseUse PyMuPDF (fitz) to extract text from each PDF. Run this Python script via Bash:
import fitz
import sys
import json
pdf_path = sys.argv[1]
doc = fitz.open(pdf_path)
pages = []
for i, page in enumerate(doc):
text = page.get_text()
pages.append({"page": i + 1, "text": text})
doc.close()
print(json.dumps({"filename": pdf_path.split("/")[-1], "total_pages": len(pages), "pages": pages}))
For each PDF, extract all pages and save the JSON output.
Read the extracted text and identify all products, variants, and specifications. This is the core intelligence step — Claude reasons over the text to structure it.
For small PDFs (≤20 pages): Process all pages at once.
For large PDFs (>20 pages): Process in chunks of 10 pages at a time. After each chunk:
Parsing instructions:
Show a summary markdown table with the parsed products. Include:
Ask: "Does this look correct? Should I adjust anything before saving?"
Ask the user (if not already specified): "Where should I save this?"
Options:
./ffe-pdf-parse-YYYY-MM-DD.csv)When saving to CSV, use the CSV header from ../../schema/product-schema.md.
Append rows to the master Google Sheet using the same 33-column schema. Set Clipped At to current timestamp and Source to pdf-parser. PDF-specific data (variant, price adder, country of origin, source filename) goes in the Notes column.
After processing, always report:
Parsed: X products from Y PDF(s)
- filename.pdf: N products extracted
- filename2.pdf: M products extracted
Issues: [list any problems]