Parse EPD (Environmental Product Declaration) PDF documents to extract structured environmental impact data — GWP, life cycle stages, certifications, and compliance metrics.
Extract structured environmental impact data from EPD (Environmental Product Declaration) PDF files. Uses PyMuPDF for text extraction and Claude's reasoning to parse varying EPD formats into a standardized 42-column schema.
EPDs follow ISO 14025 / ISO 21930 / EN 15804 and report life cycle environmental impacts of building products. This skill reads those PDFs and structures the data for comparison, specification, and LEED documentation.
The user provides EPD PDFs in one of these ways:
.pdf files)Also ask (or use defaults):
EPD data uses a — separate from the 33-column FF&E product schema. When writing to CSV, use the same column order.
| Col | Field | Description | Format |
|---|---|---|---|
| A | EPD Link | URL to original EPD document | =HYPERLINK(url, "EPD") or blank for local PDFs |
| B | Manufacturer | Company that makes the product | Title Case |
| C | Product Name | Declared product or product group | Title Case |
| D | Description | Brief product description | Sentence case |
| E | Declared Unit | Functional/declared unit (e.g., "1 m2", "1 kg", "1 m3") | As stated in EPD |
| F | Functional Unit | Functional unit with RSL if different from declared | As stated, blank if same |
| G | CSI Division | MasterFormat division number | 03, 05, 07, 08, 09, etc. |
| H | Material Category | Normalized material type | See vocabulary below |
| Col | Field | Description | Format |
|---|---|---|---|
| I | EPD Registration No. | Unique EPD identifier | As published |
| J | Program Operator | Certifying body | UL, NSF, SCS, IBU, Environdec, etc. |
| K | PCR Reference | Product Category Rule reference | Full citation |
| L | PCR Expiry | PCR expiration date | YYYY-MM-DD |
| M | Standard | Governing standard | ISO 14025, ISO 21930, EN 15804+A2, etc. |
| N | System Boundary | Scope of LCA | Cradle-to-gate, Cradle-to-grave, Cradle-to-gate with options |
| O | Valid From | EPD publication date | YYYY-MM-DD |
| P | Valid To | EPD expiration date | YYYY-MM-DD |
| Col | Field | Description | Unit |
|---|---|---|---|
| Q | GWP-total (A1-A3) | Global Warming Potential, total | kg CO2e |
| R | GWP-fossil (A1-A3) | GWP from fossil sources | kg CO2e |
| S | GWP-biogenic (A1-A3) | GWP from biogenic sources | kg CO2e |
| T | ODP (A1-A3) | Ozone Depletion Potential | kg CFC-11e |
| U | AP (A1-A3) | Acidification Potential | kg SO2e |
| V | EP (A1-A3) | Eutrophication Potential | kg PO4e |
| Col | Field | Description | Unit |
|---|---|---|---|
| W | GWP (A4-A5) | Construction stage GWP | kg CO2e |
| X | GWP (B1-B7) | Use stage GWP | kg CO2e |
| Y | GWP (C1-C4) | End-of-life GWP | kg CO2e |
| Z | GWP (D) | Beyond system boundary GWP | kg CO2e |
| AA | GWP-total (all stages) | Sum of all declared stages | kg CO2e |
| AB | POCP (A1-A3) | Photochemical Ozone Creation Potential | kg C2H4e |
| Col | Field | Description | Unit |
|---|---|---|---|
| AC | PERE (A1-A3) | Primary Energy, Renewable, energy use | MJ |
| AD | PENRE (A1-A3) | Primary Energy, Non-Renewable, energy use | MJ |
| AE | Total Energy (A1-A3) | PERE + PENRE | MJ |
| AF | FW (A1-A3) | Fresh Water Use | m3 |
| AG | Recycled Content | Percentage of recycled content | % |
| AH | Waste (A1-A3) | Total waste generated | kg |
| Col | Field | Description | Format |
|---|---|---|---|
| AI | LEED Eligible | MRc2 compliance flag | Yes, No, Partial |
| AJ | EC3 ID | Building Transparency EC3 identifier | As listed, blank if unknown |
| AK | Plant/Facility | Manufacturing plant or facility name | As stated |
| AL | Country | Manufacturing country | ISO 3166-1 alpha-2 |
| AM | Parsed At | Timestamp of parsing | ISO 8601 |
| AN | Tags | User-assigned tags | Comma-separated |
| AO | Notes | Additional context | Free text — see below |
| AP | Source | Which skill created this row | epd-parser |
Use ONE normalized term: Concrete, Steel, Aluminum, Wood/Timber, Insulation, Gypsum, Glass, Ceramic/Tile, Carpet, Resilient Flooring, Roofing Membrane, Sealant, Paint/Coating, Masonry, Stone, Composite Panel, Acoustic, Cladding, Rebar, Cement, Aggregate, Furniture, Other.
EPDs contain fields that don't have dedicated columns. Append these to Notes:
Type: Product-specific or Type: Industry-average or Type: SectorVerified by: [verifier name]EN 15804+A1 or EN 15804+A2 (important for comparability)Source: holcim-readymix-epd.pdfLCA: GaBi or LCA: SimaPro or LCA: openLCAExample Notes cell: Type: Product-specific | Verified by: Underwriters Laboratories | EN 15804+A2 | LCA: GaBi | Source: holcim-readymix-epd.pdf
Parse the user's input to identify PDF file(s) and output preferences.
.pdf files and report countUse PyMuPDF (fitz) to extract text from each PDF. Run this Python script via Bash:
import fitz
import sys
import json
pdf_path = sys.argv[1]
doc = fitz.open(pdf_path)
pages = []
for i, page in enumerate(doc):
text = page.get_text()
pages.append({"page": i + 1, "text": text})
doc.close()
print(json.dumps({"filename": pdf_path.split("/")[-1], "total_pages": len(pages), "pages": pages}))
For each PDF, extract all pages and save the JSON output.
Read the extracted text and identify all environmental impact data. This is the core intelligence step.
For small EPDs (<=30 pages): Process all pages at once.
For large EPDs (>30 pages): Process in chunks of 15 pages. Carry forward context between chunks.
Parsing instructions:
Identify the EPD structure — EPDs typically have these sections:
Extract product identity first — manufacturer, product name, declared unit, functional unit. These are always on page 1-2.
Extract EPD metadata — registration number, program operator, PCR, standard, dates, system boundary. Usually on page 1 or in a header/sidebar.
Find and parse impact indicator tables — This is the most critical step:
Extract resource use — PERE, PENRE, fresh water, waste. Usually in a separate table following the impact indicators.
Extract additional data — recycled content %, manufacturing plant, country of origin, LCA software, verifier name.
Determine LEED eligibility — based on EPD type and verification status.
Leave fields blank rather than guessing — if a field isn't in the EPD, leave it empty.
Some EPDs declare impacts for multiple products, product groups, or concrete mix designs. Create one row per product/variant:
Show a summary table for each parsed EPD:
## EPD Parse Results
### holcim-readymix-epd.pdf
| Field | Value |
|-------|-------|
| Product | ReadyMix Concrete — 4000 PSI |
| Manufacturer | Holcim |
| Declared Unit | 1 m3 |
| GWP (A1-A3) | 312 kg CO2e |
| System Boundary | Cradle-to-gate |
| Program Operator | NSF |
| Valid | 2024-01-15 to 2029-01-15 |
| LEED Eligible | Yes |
Products extracted: 3 (3000 PSI, 4000 PSI, 5000 PSI)
Ask: "Does this look correct? Should I adjust anything before saving?"
Ask the user (if not already specified): "Where should I save this?"
Options:
./epd-data-YYYY-MM-DD.csv)When saving to CSV, use the 42-column header:
EPD Link,Manufacturer,Product Name,Description,Declared Unit,Functional Unit,CSI Division,Material Category,EPD Registration No.,Program Operator,PCR Reference,PCR Expiry,Standard,System Boundary,Valid From,Valid To,GWP-total (A1-A3),GWP-fossil (A1-A3),GWP-biogenic (A1-A3),ODP (A1-A3),AP (A1-A3),EP (A1-A3),GWP (A4-A5),GWP (B1-B7),GWP (C1-C4),GWP (D),GWP-total (all stages),POCP (A1-A3),PERE (A1-A3),PENRE (A1-A3),Total Energy (A1-A3),FW (A1-A3),Recycled Content,Waste (A1-A3),LEED Eligible,EC3 ID,Plant/Facility,Country,Parsed At,Tags,Notes,Source
EXPIRED — valid to YYYY-MM-DD. Still parse the data.After processing, always report:
Parsed: X products from Y EPD PDF(s)
- filename.pdf: N products extracted
- filename2.pdf: M products extracted
Issues: [list any problems — expired, scanned, missing tables, etc.]