Read a construction invoice PDF once, detect document boundaries, and extract raw structured data from every document. Write extracted.json. Never corrects values — downstream skills handle comparison and judgment.
Raw, faithful extraction. Read every page of the PDF exactly once. For each document, transcribe exactly what it says — amounts, invoice numbers, dates, vendor names — without correcting, reconciling, or choosing between conflicting values. The ingest layer is a transcription layer, not an analysis layer.
You are a camera, not an editor.
Corrections, comparisons, and judgments happen downstream in the matcher and analyzer.
Every page must be read. Every document must be individually extracted. Every amount must be verified against the source. If the PDF has 40 pages with 30 separate invoices, read all 40 and extract all 30.
A path to one or more PDFs, a directory, or a zip file. If a zip, extract it first. If a directory, scan recursively for PDFs, spreadsheets, and context files.
Write a JSON file to the working directory: extracted.json
The schema has two independent sections that are extracted from different parts of the PDF. They must NOT influence each other during extraction.
{
"source_pdf": "/path/to/invoice.pdf",
"page_count": 40,
"ingest_timestamp": "2026-03-31T10:00:00Z",
"parent_invoice": {
"pages": [1, 2],
"vendor": "ABC Construction",
"invoice_number": "Project-MAR26",
"date": "2026-03-28",
"bill_to": "Jane Homeowner",
"line_items": [
{
"id": "li_001",
"supplier": "Apex Drywall",
"description": "Framing and Durock install at fireplace",
"amount": 2530.00,
"invoice_number": "043",
"date": "2026-02-23",
"cost_code": "2900",
"is_credit": false
}
],
"financial_summary": {
"subtotal": 204272.08,
"markup_label": "Builders Comp",
"markup_rate": null,
"markup_amount": 30640.81,
"markup_credits": [
{ "description": "Credit for BC on Elevator Draw", "amount": -10275.00 }
],
"retention": 0.00,
"retention_note": "reached 75% threshold",
"invoice_total": 224637.89,
"prior_balance": 40000.00,
"total_due": 264637.89
}
},
"supporting_documents": [
{
"id": "supp_001",
"pages": [3],
"vendor": "Apex Drywall Specialists",
"invoice_number": "043",
"date": "2026-02-23",
"total_amount": 2530.00,
"amount_due": 2530.00,
"document_type": "standard_invoice",
"po_number": "SILV-5500-1",
"line_items": [
{ "description": "Labor: 52h @ $40", "amount": 2080.00 },
{ "description": "Material", "amount": 450.00 }
],
"math_verified": true,
"notes": ""
}
],
"extraction_warnings": []
}
Each line item in parent_invoice.line_items records exactly what appears on the GC's detail summary page (typically page 2). The fields come from the summary's columns:
supplier — from the Suppliers columndescription — from the Material/Description columnamount — from the Amount column (negative for credits shown in parentheses)invoice_number — from the Invoice # column as the GC wrote itdate — from the Date column as the GC wrote itcost_code — from the LW Act# / Cost Code columnDo NOT look at supporting documents to fill in or correct parent line item fields. The parent line item is a record of what the GC claims.
Each entry in supporting_documents records exactly what appears on that vendor's invoice/receipt. The fields come from the supporting document:
vendor — from the document's letterheadinvoice_number — from the document's invoice number fielddate — from the document's date fieldtotal_amount / amount_due — from the document's totalsDo NOT look at the parent summary to fill in or correct supporting document fields. The supporting document is a record of what the vendor claims.
The matcher's entire job is to compare these two independent records and find mismatches. If ingest "helps" by reconciling them, the matcher has nothing to find.
Example of what goes WRONG if ingest corrects:
Example of what goes RIGHT with raw extraction:
"invoice_number": "1054" (what the GC wrote)"invoice_number": "1034" (what Superdry's invoice says)Read the cover page and detail summary first. Extract:
Read the rest of the PDF in batches. For each page, detect document boundaries and extract data.
Boundary detection signals:
Multi-page grouping:
For each document, extract exactly what it says:
| Field | Required | Notes |
|---|---|---|
| vendor | Yes | Company name from letterhead, exactly as printed |
| invoice_number | If present | Vendor's invoice number, exactly as printed |
| date | Yes | Invoice date from the document |
| total_amount | Yes | Total on the invoice |
| amount_due | Yes | Balance due (may differ from total if prior payments) |
| document_type | Yes | See types below |
| po_number | If present | Purchase order number |
| line_items | Yes | Every line item with description and amount |
Document types:
standard_invoice — formal typed invoicereceipt_photo — photographed retail receiptprogress_billing — shows cumulative contract + incremental drawrental_invoice — equipment rental with period chargespay_application — AIA-style or custom pay requestcredit_memo — all amounts negativelabor_detail — internal timesheet/hours breakdowncredit_card_statement — CC transaction printout (not a proper invoice)Progress billing rule: For progress billings, amount_due is the INCREMENTAL amount (due this pay request), not the cumulative contract total. Verify: amount_due = total_billed_to_date - prior_payments.
Apply ONLY format normalization. These changes make the data machine-readable without altering the actual values:
$, commas. Credits in parentheses (2,149.00) → -2149.00. Round to 2 decimal places.Do NOT normalize:
Write the complete structured data. Include an extraction_warnings array for anything noteworthy during extraction:
Warnings are observations, not corrections. "Parent shows date 6/1/2025 which seems anomalous for a Feb 2026 draw" is a good warning. Changing the date to 2/16/2026 is not — that's the analyzer's job.
extraction_warnings and continue