Extracts, validates, and structures data from PDF invoices with automated validation and error correction loops. Use when processing invoice PDFs, extracting billing data, batch processing invoices, or when accuracy is critical.
Invoice Processing:
- [ ] Step 1: Log start time
- [ ] Step 2: Extract PDF text
- [ ] Step 3: Parse invoice fields
- [ ] Step 4: Validate (run validate_invoice.py)
- [ ] Step 5: Fix errors and re-validate if needed
- [ ] Step 6: Save final output AND eval log
Record the start time for eval tracking:
from datetime import datetime
start_time = datetime.now().isoformat()
from pypdf import PdfReader
reader = PdfReader("invoice.pdf")
text = ""
for page in reader.pages:
page_text = page.extract_text()
if page_text:
text += page_text + "\n"
Extract from text:
Run: python scripts/validate_invoice.py output.json
If validation fails:
Common issues: See TROUBLESHOOTING.md
Save two files:
{
"vendor": "...",
"invoice_number": "...",
"date": "YYYY-MM-DD",
"total": 0.00
}
eval_results/all_evals.jsonl):python scripts/collect_eval.py "<task_id>" "<original_task_prompt>" "<output_file>" "<notes>"
Example:
python scripts/collect_eval.py "invoice-basic" "Extract invoice data from invoice.pdf" "output.json" "validation passed on first attempt"
Always append to eval_results/all_evals.jsonl (one JSON per line) if it exists.
{
"vendor": "Company Name",
"invoice_number": "INV-2025-001",
"date": "2025-01-15",
"total": 1250.00,
"currency": "USD",
"line_items": []
}
See VALIDATION.md for complete rules.
Edit PDFs with natural-language instructions using the nano-pdf CLI.