Reads and extracts structured engineering data from technical documents (PDFs, Word, Excel, CSV) and engineering images/drawings (P&IDs, vendor datasheets, mechanical arrangements, performance maps). USE WHEN: a user provides engineering documents or images — equipment data sheets, technical requirements, design basis, well test reports, P&ID descriptions, inspection reports, standards, vendor drawings, compressor maps, phase envelopes — and needs structured data for process simulation. Covers document classification, extraction patterns by document type, image/figure analysis with view_image, unit normalization, data quality scoring, and output formats.
Extract structured engineering data from technical documents and convert it into formats usable by process simulation, mechanical design, and engineering analysis.
Classify → Extract → Normalize → Validate → Output
- Classify the document type to select the right extraction strategy
- Extract raw data using format-specific tools (PDF, Word, Excel)
- Normalize units, component names, and field names to standard conventions
- Validate extracted data against physical bounds and completeness checks
- Output structured JSON/dict for downstream consumption
When a document is provided, classify it first:
| Document Type | Identifying Features |
|---|
| Key Data to Extract |
|---|
| Equipment Data Sheet | API/ASME header, tag numbers, design conditions table | Design T/P, materials, dimensions, nozzle schedule |
| Technical Requirement (TR) | Numbered clauses, "shall"/"should" language, design factors | Design rules, factors, material constraints, limits |
| Design Basis | Operating envelope, fluid composition tables, flow scheme description | Compositions, T/P ranges, flow rates, design life |
| Heat & Mass Balance | Stream table with T, P, flow, composition columns | Stream conditions, equipment duties, compositions |
| Well Test Report | Choke sizes, flow rates vs time, GOR, water cut | Flow rates, compositions, reservoir P/T, productivity |
| P&ID / PFD Description | Equipment tags, line numbers, instrument tags, connectivity | Topology, instrumentation, control philosophy |
| Piping Specification | Material classes, pressure ratings, wall schedules | Pipe sizes, materials, ratings, corrosion allowance |
| Inspection Report | Thickness measurements, corrosion rates, anomaly locations | Wall thickness, corrosion rate, remaining life |
| Material Certificate | Heat numbers, SMYS/SMTS, chemical composition, Charpy values | Mechanical properties, chemical analysis, grade |
| Standards Document | ISO/API/ASME/DNV/NORSOK header, normative references | Design formulas, factors, limits, test requirements |
| Vendor Quotation | Equipment specs, pricing, delivery, performance curves | Performance data, dimensions, weight, cost |
| Operating Procedure | Step-by-step instructions, setpoints, alarm limits | Operating envelope, control setpoints, trip values |
| Engineering Drawing (P&ID) | Piping & instrumentation diagram — lines, valves, instruments, tags visually laid out | Equipment topology, valve tags, instrument tags, line sizes, piping class, connection types |
| Mechanical Arrangement Drawing | GA/section/plan views with dimensions, elevations, nozzle locations | Physical dimensions, elevations, nozzle orientations, piping run lengths, standpipe geometry |
| Vendor API Datasheet (image) | API 610/614/617/692 format data sheets rendered as images in PDFs | Seal type, operating conditions, leakage rates, gas supply requirements, material specs |
| Performance Map / Curve | Compressor maps, pump curves, phase envelopes, operating windows | Head vs flow, efficiency, surge line, operating point, cricondentherm/bar |
| Process Flow Diagram (image) | Visual PFD with equipment symbols, stream arrows, T/P/flow annotations | Equipment sequence, stream conditions, control valves, key operating parameters |
When document type isn't stated, look for these patterns:
DOCUMENT_SIGNATURES = {
"equipment_data_sheet": [
r"(?i)data\s*sheet", r"(?i)tag\s*no", r"(?i)design\s+press",
r"(?i)operating\s+press", r"(?i)nozzle\s+schedule"
],
"technical_requirement": [
r"(?i)technical\s+requirement", r"(?i)\bTR[\s-]?\d+",
r"(?i)\bshall\b.*\bdesign", r"(?i)design\s+factor"
],
"design_basis": [
r"(?i)design\s+basis", r"(?i)basis\s+of\s+design",
r"(?i)operating\s+envelope", r"(?i)ambient\s+conditions"
],
"heat_mass_balance": [
r"(?i)heat\s+.*mass\s+balance", r"(?i)stream\s+(table|summary)",
r"(?i)mol\s*%|mole\s+frac", r"(?i)vapour\s+fraction"
],
"well_test": [
r"(?i)well\s+test", r"(?i)choke\s+size", r"(?i)GOR",
r"(?i)productivity\s+index", r"(?i)IPR"
],
"inspection_report": [
r"(?i)inspection\s+report", r"(?i)wall\s+thickness",
r"(?i)corrosion\s+rate", r"(?i)remaining\s+life"
],
"material_certificate": [
r"(?i)material\s+cert", r"(?i)heat\s+number",
r"(?i)SMYS|SMTS", r"(?i)charpy|impact\s+test"
],
"piping_spec": [
r"(?i)piping\s+class", r"(?i)material\s+class",
r"(?i)line\s+class", r"(?i)pressure\s+rating"
],
"engineering_drawing_pid": [
r"(?i)piping\s+.*instrument.*diagram", r"(?i)P&ID",
r"(?i)P\s*&\s*I\s*D", r"(?i)instrument\s+diagram"
],
"mechanical_arrangement": [
r"(?i)general\s+arrangement", r"(?i)GA\s+drawing",
r"(?i)section\s+view", r"(?i)plan\s+view",
r"(?i)mechanical\s+arrangement", r"(?i)elevation\s+view"
],
"vendor_api_datasheet_image": [
r"(?i)API\s+6[19][0247]", r"(?i)dry\s+gas\s+seal",
r"(?i)seal\s+gas", r"(?i)vendor\s+data\s*sheet",
r"(?i)mechanical\s+seal", r"(?i)compressor\s+data\s*sheet"
],
"performance_map": [
r"(?i)performance\s+(map|curve)", r"(?i)compressor\s+map",
r"(?i)pump\s+curve", r"(?i)phase\s+envelope",
r"(?i)head\s+vs\s+flow", r"(?i)surge\s+line",
r"(?i)operating\s+point", r"(?i)cricondentherm"
],
}
Recommended library: pdfplumber (table-aware), fallback to pymupdf (fitz)
import pdfplumber
def extract_pdf(filepath):
"""Extract text and tables from a PDF document."""
pages = []
tables = []
with pdfplumber.open(filepath) as pdf:
for i, page in enumerate(pdf.pages):
# Extract text
text = page.extract_text() or ""
pages.append({"page": i + 1, "text": text})
# Extract tables (pdfplumber detects table boundaries)
for table in page.extract_tables():
if table and len(table) > 1:
headers = [str(h).strip() if h else "" for h in table[0]]
rows = []
for row in table[1:]:
rows.append([str(c).strip() if c else "" for c in row])
tables.append({
"page": i + 1,
"headers": headers,
"rows": rows
})
return {"pages": pages, "tables": tables}
Extracting figures and diagrams from PDFs (PREFERRED):
Use devtools/pdf_to_figures.py to convert PDF pages to PNG images for visual
analysis by AI tools (view_image). This is the fastest way to inspect
engineering drawings, P&IDs, charts, tables, and compressor maps:
# CLI — convert all PDFs in the references folder
python devtools/pdf_to_figures.py step1_scope_and_research/references/ --outdir figures/
# CLI — specific pages only (1-indexed)
python devtools/pdf_to_figures.py path/to/document.pdf --pages 1 3 5 --outdir figures/
# Python / Notebook — programmatic use
from devtools.pdf_to_figures import pdf_to_pngs, pdf_folder_to_pngs
pngs = pdf_to_pngs("references/compressor_sketch.pdf", outdir="figures/")
all_pngs = pdf_folder_to_pngs("step1_scope_and_research/references/", outdir="figures/")
Then use view_image on the extracted PNGs to read diagrams, extract data from
charts, identify equipment layouts, and analyze engineering drawings. Requires pymupdf.
Handling scanned PDFs:
extract_text() returns empty, the PDF is likely scanned/image-basedpdf_to_figures.py to render pages as images, then use view_image for AI readingpytesseract + pdf2imageLibrary: python-docx
from docx import Document
def extract_docx(filepath):
"""Extract paragraphs and tables from a Word document."""
doc = Document(filepath)
paragraphs = []
tables = []
for para in doc.paragraphs:
if para.text.strip():
paragraphs.append({
"text": para.text.strip(),
"style": para.style.name,
"level": _heading_level(para.style.name)
})
for i, table in enumerate(doc.tables):
headers = [cell.text.strip() for cell in table.rows[0].cells]
rows = []
for row in table.rows[1:]:
rows.append([cell.text.strip() for cell in row.cells])
tables.append({"index": i, "headers": headers, "rows": rows})
return {"paragraphs": paragraphs, "tables": tables}
def _heading_level(style_name):
"""Return heading level (1-9) or 0 for body text."""
if style_name.startswith("Heading"):
try:
return int(style_name.split()[-1])
except ValueError:
return 0
return 0
Libraries: openpyxl for .xlsx, pandas for both
import pandas as pd
def extract_excel(filepath, sheet_name=None):
"""Extract tables from Excel workbook."""
sheets = {}
xls = pd.ExcelFile(filepath)
target_sheets = [sheet_name] if sheet_name else xls.sheet_names
for name in target_sheets:
df = pd.read_excel(xls, sheet_name=name)
# Drop fully empty rows/columns
df = df.dropna(how="all").dropna(axis=1, how="all")
sheets[name] = {
"headers": list(df.columns),
"rows": df.values.tolist(),
"shape": list(df.shape)
}
return sheets
def extract_csv(filepath):
"""Extract table from CSV file."""
df = pd.read_csv(filepath)
df = df.dropna(how="all").dropna(axis=1, how="all")
return {
"headers": list(df.columns),
"rows": df.values.tolist(),
"shape": list(df.shape)
}
Equipment data sheets follow standard API/ISO layouts. Key sections:
DATASHEET_SECTIONS = {
"general": {
"patterns": [
r"(?i)tag\s*(?:no|number|#)[\s:]+(\S+)",
r"(?i)service[\s:]+(.+?)(?:\n|$)",
r"(?i)quantity[\s:]+(\d+)",
r"(?i)type[\s:]+(.+?)(?:\n|$)",
],
"fields": ["tag_number", "service", "quantity", "equipment_type"]
},
"design_conditions": {
"patterns": [
r"(?i)design\s+press(?:ure)?[\s:]+([0-9.]+)\s*(bar|bara|barg|psi|psig|MPa|kPa)",
r"(?i)design\s+temp(?:erature)?[\s:]+([0-9.+-]+)\s*(°?[CFK]|degC|degF)",
r"(?i)oper(?:ating)?\s+press(?:ure)?[\s:]+([0-9.]+)\s*(bar|bara|barg|psi|psig|MPa|kPa)",
r"(?i)oper(?:ating)?\s+temp(?:erature)?[\s:]+([0-9.+-]+)\s*(°?[CFK]|degC|degF)",
],
"fields": ["design_pressure", "design_temperature",
"operating_pressure", "operating_temperature"]
},
"dimensions": {
"patterns": [
r"(?i)(?:ID|inner\s+diameter|internal\s+diameter)[\s:]+([0-9.]+)\s*(mm|m|in|inch|ft)",
r"(?i)(?:OD|outer\s+diameter|outside\s+diameter)[\s:]+([0-9.]+)\s*(mm|m|in|inch|ft)",
r"(?i)length[\s:]+([0-9.]+)\s*(mm|m|ft)",
r"(?i)(?:t-t|tan.*tan|seam.*seam|height)[\s:]+([0-9.]+)\s*(mm|m|ft)",
r"(?i)wall\s+thick(?:ness)?[\s:]+([0-9.]+)\s*(mm|in|inch)",
],
"fields": ["inner_diameter", "outer_diameter", "length",
"height", "wall_thickness"]
},
"material": {
"patterns": [
r"(?i)(?:shell|body)\s+material[\s:]+(\S+(?:\s+\S+)?)",
r"(?i)material\s+grade[\s:]+(\S+)",
r"(?i)(?:SA|ASTM|A)\s*-?\s*(\d+)(?:\s*-?\s*(?:Gr\.?\s*)?(\S+))?",
],
"fields": ["shell_material", "material_grade"]
}
}
Table extraction for data sheets:
def extract_datasheet_table(table):
"""Parse an equipment data sheet table into structured fields.
Handles common layouts where col 0 is field name, col 1+ are values.
Also handles multi-case tables (normal/upset/design columns).
"""
result = {}
headers = table["headers"]
for row in table["rows"]:
if len(row) >= 2 and row[0]:
field_name = row[0].strip().lower()
# Multi-case: create dict with case names as keys
if len(headers) > 2 and len(row) > 2:
for i, header in enumerate(headers[1:], start=1):
if i < len(row) and row[i]:
case_key = header.strip() if header else f"case_{i}"
if field_name not in result:
result[field_name] = {}
result[field_name][case_key] = _parse_value(row[i])
else:
result[field_name] = _parse_value(row[1])
return result
def _parse_value(text):
"""Parse a text value, attempting numeric conversion."""
text = str(text).strip()
if not text or text == "-" or text.lower() == "n/a":
return None
try:
return float(text.replace(",", ""))
except ValueError:
return text
Stream tables are the most common source of process data. They vary in layout:
Layout A — Streams as columns:
| Property | Stream 1 | Stream 2 | Stream 3 |
|---|---|---|---|
| Temperature (°C) | 40.0 | 15.0 | 120.0 |
| Pressure (bara) | 80.0 | 78.0 | 150.0 |
Layout B — Streams as rows:
| Stream | T (°C) | P (bara) | Flow (kg/h) |
|---|---|---|---|
| Feed | 40.0 | 80.0 | 75000 |
def extract_stream_table(table, layout="auto"):
"""Extract stream data from a heat & mass balance table.
Args:
table: dict with "headers" and "rows" keys
layout: "columns" (streams as columns), "rows" (streams as rows), or "auto"
Returns:
dict mapping stream names to their properties
"""
headers = table["headers"]
rows = table["rows"]
if layout == "auto":
layout = _detect_stream_layout(headers, rows)
streams = {}
if layout == "columns":
# First column is property name, remaining columns are streams
stream_names = [h.strip() for h in headers[1:] if h and h.strip()]
for row in rows:
if not row[0]:
continue
prop_name, prop_unit = _parse_property_header(row[0])
for i, stream_name in enumerate(stream_names):
col_idx = i + 1
if col_idx < len(row) and row[col_idx]:
if stream_name not in streams:
streams[stream_name] = {}
streams[stream_name][prop_name] = {
"value": _parse_value(row[col_idx]),
"unit": prop_unit
}
else:
# First column (or first few) is stream name, rest are properties
for row in rows:
if not row[0]:
continue
stream_name = str(row[0]).strip()
streams[stream_name] = {}
for j, header in enumerate(headers[1:], start=1):
if j < len(row) and row[j]:
prop_name, prop_unit = _parse_property_header(header)
streams[stream_name][prop_name] = {
"value": _parse_value(row[j]),
"unit": prop_unit
}
return streams
def _detect_stream_layout(headers, rows):
"""Heuristic: if first header looks like a property label, it's column layout."""
first_header = str(headers[0]).strip().lower()
property_keywords = ["property", "parameter", "description", "item",
"stream", "temperature", "pressure", "flow"]
if any(kw in first_header for kw in property_keywords[:4]):
return "columns"
return "rows"
def _parse_property_header(text):
"""Split 'Temperature (°C)' into ('temperature', '°C')."""
import re
text = str(text).strip()
match = re.match(r"(.+?)\s*[\(\[]([^\)\]]+)[\)\]]", text)
if match:
return match.group(1).strip().lower(), match.group(2).strip()
return text.lower(), ""
# Common header variations for composition columns
COMPOSITION_HEADERS = {
"component": [
r"(?i)component", r"(?i)species", r"(?i)compound", r"(?i)name"
],
"mole_fraction": [
r"(?i)mol\s*%", r"(?i)mole\s+frac", r"(?i)mol\s+frac",
r"(?i)y\s*[\(\[]", r"(?i)x\s*[\(\[]", r"(?i)z\s*[\(\[]",
r"(?i)composition\s*\(mol", r"(?i)molar"
],
"mass_fraction": [
r"(?i)wt\s*%", r"(?i)mass\s+frac", r"(?i)weight\s+frac",
r"(?i)mass\s*%"
],
"volume_fraction": [
r"(?i)vol\s*%", r"(?i)volume\s+frac", r"(?i)liq\s+vol\s*%"
]
}
# Component name mapping (common variations → NeqSim names)
COMPONENT_NAME_MAP = {
# Hydrocarbons
"c1": "methane", "ch4": "methane", "methane": "methane",
"c2": "ethane", "c2h6": "ethane", "ethane": "ethane",
"c3": "propane", "c3h8": "propane", "propane": "propane",
"ic4": "i-butane", "i-c4": "i-butane", "isobutane": "i-butane",
"nc4": "n-butane", "n-c4": "n-butane",
"ic5": "i-pentane", "i-c5": "i-pentane", "isopentane": "i-pentane",
"nc5": "n-pentane", "n-c5": "n-pentane",
"nc6": "n-hexane", "n-c6": "n-hexane", "hexane": "n-hexane",
"nc7": "n-heptane", "n-c7": "n-heptane", "heptane": "n-heptane",
"nc8": "n-octane", "n-c8": "n-octane", "octane": "n-octane",
"nc9": "n-nonane", "n-c9": "n-nonane",
"nc10": "nC10", "n-c10": "nC10",
"cyclohexane": "cyclohexane", "c-c6": "cyclohexane",
"benzene": "benzene", "toluene": "toluene",
# Non-hydrocarbons
"co2": "CO2", "carbon dioxide": "CO2",
"h2s": "H2S", "hydrogen sulfide": "H2S", "hydrogen sulphide": "H2S",
"n2": "nitrogen", "nitrogen": "nitrogen",
"h2": "hydrogen", "hydrogen": "hydrogen",
"h2o": "water", "water": "water",
"o2": "oxygen", "oxygen": "oxygen",
"ar": "argon", "argon": "argon",
"he": "helium", "helium": "helium",
"co": "CO",
# Glycols
"meg": "MEG", "monoethylene glycol": "MEG",
"deg": "DEG", "diethylene glycol": "DEG",
"teg": "TEG", "triethylene glycol": "TEG",
# Mercaptans
"methyl mercaptan": "methyl-mercaptan", "ch3sh": "methyl-mercaptan",
"ethyl mercaptan": "ethyl-mercaptan", "c2h5sh": "ethyl-mercaptan",
}
def normalize_component_name(raw_name):
"""Map a raw component name to the NeqSim standard name."""
key = raw_name.strip().lower().replace("_", " ").replace("-", " ")
# Direct lookup
if key in COMPONENT_NAME_MAP:
return COMPONENT_NAME_MAP[key]
# Try removing spaces
key_nospace = key.replace(" ", "")
if key_nospace in COMPONENT_NAME_MAP:
return COMPONENT_NAME_MAP[key_nospace]
# Return original (user may need to map manually)
return raw_name.strip()
def extract_composition(table):
"""Extract fluid composition from a table.
Returns dict of {neqsim_name: mole_fraction} and metadata about
the composition basis (mole%, mass%, etc.).
"""
import re
headers = table["headers"]
# Find component column and value column
comp_col = None
value_col = None
basis = "mole_fraction"
for i, h in enumerate(headers):
h_str = str(h).strip()
for pattern in COMPOSITION_HEADERS["component"]:
if re.search(pattern, h_str):
comp_col = i
break
for basis_type in ["mole_fraction", "mass_fraction", "volume_fraction"]:
for pattern in COMPOSITION_HEADERS[basis_type]:
if re.search(pattern, h_str):
value_col = i
basis = basis_type
break
# Fallback: assume col 0 = component, col 1 = value
if comp_col is None:
comp_col = 0
if value_col is None:
value_col = 1
composition = {}
for row in table["rows"]:
if comp_col < len(row) and value_col < len(row):
name = str(row[comp_col]).strip()
value = _parse_value(row[value_col])
if name and value is not None:
neqsim_name = normalize_component_name(name)
composition[neqsim_name] = float(value)
# Normalize: if values sum to ~100, convert to fractions
total = sum(composition.values())
if total > 1.5: # Likely percentages
composition = {k: v / 100.0 for k, v in composition.items()}
total = sum(composition.values())
return {
"components": composition,
"basis": basis,
"total": total,
"normalized": abs(total - 1.0) < 0.02
}
TR documents contain design rules expressed as requirements. Extract:
def extract_requirements(text):
"""Extract numbered requirements from a technical requirements document.
Looks for 'shall' and 'should' statements with associated values.
"""
import re
requirements = []
# Pattern: numbered clause + requirement text
clause_pattern = re.compile(
r"(\d+(?:\.\d+)*)\s+(.*?(?:shall|should|must|may)\s+.*?)(?=\n\d+(?:\.\d+)*\s|\Z)",
re.IGNORECASE | re.DOTALL
)
for match in clause_pattern.finditer(text):
clause_num = match.group(1)
req_text = match.group(2).strip()
# Extract numeric values with units
values = re.findall(
r"(\d+(?:\.\d+)?)\s*(bar[ag]?|°?[CFK]|mm|m|kg|psi|MPa|kPa|%|hr|min)",
req_text
)
# Classify requirement type
if re.search(r"(?i)shall\b", req_text):
level = "mandatory"
elif re.search(r"(?i)should\b", req_text):
level = "recommended"
else:
level = "informative"
requirements.append({
"clause": clause_num,
"text": req_text,
"level": level,
"values": [{"value": float(v), "unit": u} for v, u in values]
})
return requirements
WELL_TEST_FIELDS = {
"flow_rate": [
r"(?i)(?:oil|gas|liquid|water)\s+(?:flow\s+)?rate[\s:]+([0-9.,]+)\s*(Sm3/d|bbl/d|MMSCFD|kg/hr|m3/d)",
],
"gor": [
r"(?i)GOR[\s:]+([0-9.,]+)\s*(Sm3/Sm3|scf/bbl|m3/m3)",
],
"water_cut": [
r"(?i)water\s*cut[\s:]+([0-9.,]+)\s*(%)?",
r"(?i)BSW[\s:]+([0-9.,]+)\s*(%)?",
],
"reservoir_pressure": [
r"(?i)reservoir\s+press(?:ure)?[\s:]+([0-9.,]+)\s*(bar[ag]?|psi[ag]?|MPa|kPa)",
],
"reservoir_temperature": [
r"(?i)reservoir\s+temp(?:erature)?[\s:]+([0-9.,]+)\s*(°?[CFK])",
],
"wellhead_pressure": [
r"(?i)(?:WHP|wellhead\s+press(?:ure)?)[\s:]+([0-9.,]+)\s*(bar[ag]?|psi[ag]?|MPa)",
],
"wellhead_temperature": [
r"(?i)(?:WHT|wellhead\s+temp(?:erature)?)[\s:]+([0-9.,]+)\s*(°?[CFK])",
],
"choke_size": [
r"(?i)choke[\s:]+([0-9.,]+)(?:/64)?[\s]*(mm|inch|in|/64)",
],
"productivity_index": [
r"(?i)(?:PI|productivity\s+index)[\s:]+([0-9.,]+)\s*(Sm3/d/bar|bbl/d/psi)",
],
}
def extract_thickness_data(table):
"""Extract wall thickness measurements from an inspection table.
Common format: Location | Nominal (mm) | Measured (mm) | Min (mm) | Corrosion Rate
"""
result = []
for row in table["rows"]:
entry = {}
for i, header in enumerate(table["headers"]):
h = str(header).strip().lower()
if i < len(row) and row[i]:
val = _parse_value(row[i])
if "location" in h or "position" in h or "point" in h:
entry["location"] = str(row[i]).strip()
elif "nominal" in h or "original" in h:
entry["nominal_mm"] = val
elif "measured" in h or "actual" in h or "remaining" in h:
entry["measured_mm"] = val
elif "min" in h:
entry["minimum_mm"] = val
elif "corrosion" in h and "rate" in h:
entry["corrosion_rate_mm_yr"] = val
if entry:
result.append(entry)
return result
Many engineering documents contain critical data embedded in images rather than
text — P&IDs, mechanical arrangement drawings, vendor datasheets rendered as scanned
PDFs, compressor performance maps, and phase envelopes. Use the view_image tool
for multimodal analysis of these visual documents.
PDF/Image Document → pdf_to_figures.py → PNG pages → view_image → Structured Data
Step 1: Convert PDF pages to images
# Single file
python devtools/pdf_to_figures.py path/to/document.pdf --outdir figures/ --dpi 200
# Specific pages
python devtools/pdf_to_figures.py document.pdf --outdir figures/ --pages 1,3,5
# Batch (all PDFs in a folder)
python devtools/pdf_to_figures.py references/ --outdir figures/
Step 2: View and analyze each page with view_image
Use view_image on each extracted PNG. For each image, systematically extract:
Step 3: Structure the extracted data
Convert visual observations into the standard extraction result JSON format
(Section 6.1), adding an image_analysis block.
| Drawing Type | What to Look For | Key Data to Extract |
|---|---|---|
| P&ID (Piping & Instrumentation) | Equipment symbols, instrument bubbles, line numbers, valve symbols, piping class annotations | Equipment tags + types, valve tags + types (gate/globe/ball/check), instrument tags + functions, line numbers + sizes + ratings, control loops, interlock references |
| Mechanical Arrangement (GA) | Plan/section/elevation views, dimension lines, nozzle positions, structural supports | Overall dimensions (L×W×H), nozzle sizes + orientations, standpipe lengths + bore sizes, access platform locations, foundation bolt patterns |
| Vendor API Datasheet (image) | API format tables (610/614/617/692), operating conditions, seal/bearing details | Design T/P, seal type, shaft speed, gas supply requirements, leakage rates, material specs, utility requirements |
| Compressor/Pump Performance Map | Head vs flow curves, efficiency lines, surge line, stonewall, rated point marker | Rated head/flow/power, surge point, maximum continuous speed, efficiency at rated/off-design, operating window boundaries |
| Phase Envelope | Bubble/dew point curves, cricondentherm, cricondenbar, operating path markers, quality lines | Cricondentherm (°C), cricondenbar (bara), critical point (T,P), operating point position relative to envelope, two-phase region extent |
| Process Flow Diagram (PFD) | Equipment blocks, stream arrows, T/P/flow labels, utility connections | Equipment sequence, stream T/P/flow, utility duties, recycle loops, control valve locations |
| Seal Gas / Utility Piping Schematic | Small-bore piping, filters, orifices, check valves, vent/drain connections | Flow path topology, pipe sizes (often 3/8"–1"), filter positions, orifice sizes, vent/drain locations, instrumentation (dP transmitters, pressure gauges) |
When reading a P&ID image with view_image, extract data in this structured format:
PID_EXTRACTION = {
"document": {
"drawing_number": "SOK-xxxxxx",
"revision": "C2",
"title": "Oil/Seal Gas Piping — Compressor BCL-304/D",
"area": "26",
"unit": "GIC Compressor"
},
"equipment": [
{"tag": "BCL-304/D", "type": "Centrifugal Compressor", "service": "Gas Injection"},
{"tag": "FLT-26110", "type": "Coalescing Filter", "service": "Seal Gas Supply"},
],
"valves": [
{"tag": "VB26-0130", "type": "Ball Valve", "size_inch": 0.75,
"class": "2500 RTJ", "service": "Seal Gas Supply Isolation"},
{"tag": "XV-S26102.05", "type": "Shutdown Valve",
"service": "Primary Seal Gas", "normally": "open"},
],
"instruments": [
{"tag": "PT-S26102.03", "type": "Pressure Transmitter",
"service": "Seal Gas Supply Pressure", "range": "0-200 barg"},
{"tag": "TP-S26102.11", "type": "Temperature Point",
"service": "Primary Vent Temperature"},
{"tag": "FT-S26102.01", "type": "Flow Transmitter",
"service": "Seal Gas Flow"},
],
"piping": [
{"line_number": "26-G1H-001", "size_inch": 0.75,
"piping_class": "G1H", "rating": "2500#",
"from": "FLT-26110", "to": "BCL-304/D NDE Seal"},
],
"connections": [
{"from": "FLT-26110.outlet", "to": "BCL-304/D.seal_gas_inlet",
"type": "seal_gas_supply"},
{"from": "BCL-304/D.primary_vent", "to": "VB26-0180",
"type": "seal_vent", "destination": "Flare/Vent"},
]
}
For vendor datasheets rendered as images (common for API 692 seal datasheets, API 617 compressor datasheets):
VENDOR_DATASHEET_EXTRACTION = {
"equipment_tag": "BCL-304/D",
"vendor": "John Crane / Flowserve / EagleBurgmann",
"api_standard": "API 692",
"data_sections": {
"operating_conditions": {
"suction_pressure_barg": 125.0,
"discharge_pressure_barg": 190.0,
"seal_gas_supply_pressure_barg": 155.0,
"operating_temperature_C": 45.0,
"shaft_speed_rpm": 11000,
},
"seal_data": {
"seal_type": "Tandem dry gas seal",
"seal_arrangement": "API Plan 74 (primary) + API Plan 76 (secondary)",
"primary_leakage_nm3hr": 12.0,
"secondary_leakage_nm3hr": 2.0,
"buffer_gas": "Nitrogen",
},
"gas_supply": {
"required_flow_nm3hr": 25.0,
"required_pressure_above_ref_barg": 3.5,
"filtration_micron": 3,
"gas_temperature_range_C": [-10, 60],
},
"materials": {
"face_material": "Silicon Carbide",
"seat_material": "Carbon",
"o_ring_material": "Viton / FFKM",
"spring_material": "Inconel 718",
}
}
}
When analyzing compressor maps or phase envelopes from images:
PERFORMANCE_MAP_EXTRACTION = {
"chart_type": "compressor_map", # or "phase_envelope", "pump_curve"
"axes": {
"x": {"label": "Inlet Volume Flow", "unit": "m3/hr"},
"y": {"label": "Polytropic Head", "unit": "kJ/kg"}
},
"rated_point": {
"flow": 5000, "head": 85.0, "efficiency_pct": 82.0, "speed_rpm": 11000
},
"surge_point": {
"flow": 3200, "head": 92.0
},
"curves": [
{"speed_rpm": 11000, "points": [
{"flow": 3200, "head": 92.0},
{"flow": 4000, "head": 89.0},
{"flow": 5000, "head": 85.0},
{"flow": 6000, "head": 78.0},
]},
],
"operating_window": {
"min_flow": 3500, "max_flow": 5800,
"min_head": 70.0, "max_head": 95.0
}
}
PHASE_ENVELOPE_EXTRACTION = {
"chart_type": "phase_envelope",
"axes": {
"x": {"label": "Temperature", "unit": "C"},
"y": {"label": "Pressure", "unit": "bar"}
},
"critical_point": {"temperature_C": -82.0, "pressure_bar": 46.0},
"cricondentherm": {"temperature_C": -5.4, "pressure_bar": 38.0},
"cricondenbar": {"temperature_C": -25.0, "pressure_bar": 55.0},
"key_points": [
{"label": "Operating Point", "temperature_C": 25.0, "pressure_bar": 45.0,
"phase_region": "single_phase_gas"},
{"label": "JT Outlet", "temperature_C": -20.0, "pressure_bar": 3.0,
"phase_region": "single_phase_gas"},
],
"retrograde_region": {
"temperature_range_C": [-40, -5],
"pressure_range_bar": [30, 55],
"max_liquid_fraction_pct": 8.0
}
}
For GA drawings showing physical layout and dimensions:
MECHANICAL_ARRANGEMENT_EXTRACTION = {
"drawing_type": "mechanical_arrangement",
"equipment_tag": "BCL-304",
"views": ["plan", "section_A-A", "elevation_north"],
"overall_dimensions": {
"length_mm": 4500,
"width_mm": 2800,
"height_mm": 3200
},
"nozzles": [
{"tag": "N1", "service": "Suction", "size_inch": 16,
"orientation": "horizontal", "elevation_mm": 1500},
{"tag": "N2", "service": "Discharge", "size_inch": 12,
"orientation": "horizontal", "elevation_mm": 1500},
{"tag": "N5", "service": "Seal Gas Supply", "size_inch": 0.75,
"orientation": "radial", "elevation_mm": 1800},
],
"standpipes_and_drains": [
{"service": "Primary Vent Drain", "od_inch": 1.5, "id_mm": 38,
"length_m": 1.5, "volume_liters": 1.70,
"orientation": "vertical_down", "low_point_elevation_mm": 200},
],
"piping_runs": [
{"from": "BCL-304.N5", "to": "FLT-26110", "pipe_size_inch": 0.75,
"material": "SS316", "approx_length_m": 8.0}
]
}
After extracting data from engineering figures during a task analysis, generate discussion blocks for the report. Each figure discussion follows this template:
FIGURE_DISCUSSION_TEMPLATE = {
"figure": "filename.png",
"title": "Descriptive Title of the Figure",
"observation": "What the figure shows, with specific numbers and features identified.",
"mechanism": "The underlying physical or engineering reason for what is observed.",
"implication": "What this means for the design, operation, or safety assessment.",
"recommendation": "Specific actionable engineering recommendation based on this figure.",
"linked_results": ["key_result_1", "key_result_2"],
"insight_question_ref": "Q3"
}
When to generate figure discussions:
| Figure Source | When Discussion is Needed |
|---|---|
| Simulation output plot (T/P profiles, phase envelopes) | Always for decision-critical figures; proportional for others |
| Vendor datasheet image | When extracted data influences design decisions |
| P&ID image | When topology reveals flow paths critical to the analysis |
| Mechanical drawing | When dimensions affect calculations (e.g., standpipe volumes) |
| Performance map | When operating point proximity to limits is relevant |
| Benchmark comparison plot | Always — explains deviations |
Always use pdf_to_figures.py first — render PDF pages to PNG at 200+ DPI
before attempting to read visual content. Direct text extraction misses drawings.
Systematic scanning — when viewing a P&ID or drawing with view_image,
scan systematically: title block → equipment → instruments → piping → notes.
Don't try to extract everything in one pass.
Cross-reference text and images — use text-extracted data (tables, narrative) to validate what you see in images. If a table says "3/4 inch" but the drawing annotation reads "1 inch", flag the conflict.
Dimensional extraction from drawings — when reading dimensions, note:
Performance map reading — when digitizing curves from performance maps:
Scanned/low-resolution images — if image quality is poor:
Multi-page drawings — large P&IDs often span multiple sheets (1/3, 2/3, 3/3). Track which equipment appears on which sheet and merge topology across sheets.
All extracted values must be normalized to consistent engineering units.
| Quantity | Standard Unit | Symbol |
|---|---|---|
| Temperature | Kelvin | K |
| Pressure | bara | bara |
| Mass flow | kg/hr | kg/hr |
| Volumetric flow (gas) | Sm³/d | Sm3/d |
| Volumetric flow (liquid) | m³/hr | m3/hr |
| Length | m | m |
| Diameter | mm | mm |
| Wall thickness | mm | mm |
| Density | kg/m³ | kg/m3 |
| Viscosity | mPa·s (cP) | cP |
| Power | kW | kW |
| Heat duty | kW | kW |
| Heat transfer coeff. | W/(m²·K) | W/m2K |
| Thermal conductivity | W/(m·K) | W/mK |
def convert_temperature(value, from_unit, to_unit="K"):
"""Convert temperature between C, F, K, R."""
unit = from_unit.strip().replace("°", "").replace("deg", "").upper()
# Convert to Kelvin first
if unit in ("C", "CELSIUS"):
kelvin = value + 273.15
elif unit in ("F", "FAHRENHEIT"):
kelvin = (value - 32) * 5.0 / 9.0 + 273.15
elif unit in ("R", "RANKINE"):
kelvin = value * 5.0 / 9.0
else:
kelvin = value # Assume Kelvin
if to_unit.upper() in ("C", "CELSIUS"):
return kelvin - 273.15
elif to_unit.upper() in ("F", "FAHRENHEIT"):
return (kelvin - 273.15) * 9.0 / 5.0 + 32
return kelvin
def convert_pressure(value, from_unit, to_unit="bara"):
"""Convert pressure to bara."""
unit = from_unit.strip().lower()
conversions_to_bara = {
"bara": 1.0,
"barg": lambda v: v + 1.01325,
"bar": 1.0, # Assume absolute unless 'g' suffix
"psia": 0.0689476,
"psig": lambda v: (v + 14.696) * 0.0689476,
"psi": 0.0689476, # Assume absolute
"mpa": 10.0,
"kpa": 0.01,
"atm": 1.01325,
"mmhg": 0.00133322,
"torr": 0.00133322,
}
factor = conversions_to_bara.get(unit, 1.0)
if callable(factor):
return factor(value)
return value * factor
def convert_length(value, from_unit, to_unit="mm"):
"""Convert length to mm."""
unit = from_unit.strip().lower()
to_mm = {
"mm": 1.0, "cm": 10.0, "m": 1000.0, "km": 1e6,
"in": 25.4, "inch": 25.4, "inches": 25.4,
"ft": 304.8, "feet": 304.8, "foot": 304.8,
"yd": 914.4,
}
mm_val = value * to_mm.get(unit, 1.0)
to_target = {"mm": 1.0, "m": 0.001, "in": 1.0 / 25.4, "ft": 1.0 / 304.8}
return mm_val * to_target.get(to_unit.lower(), 1.0)
def convert_flow(value, from_unit, to_unit="kg/hr"):
"""Convert mass/volumetric flow rates. Approximate — exact conversion needs density."""
unit = from_unit.strip().lower().replace(" ", "")
# Direct mass flow conversions to kg/hr
mass_to_kghr = {
"kg/hr": 1.0, "kg/h": 1.0, "kg/s": 3600.0,
"t/hr": 1000.0, "t/h": 1000.0, "t/d": 1000.0 / 24.0,
"lb/hr": 0.453592, "lb/h": 0.453592,
}
if unit in mass_to_kghr:
return value * mass_to_kghr[unit]
return value # Return as-is with warning
Rate extracted data on completeness for downstream simulation:
def score_extraction_quality(extracted_data, document_type):
"""Score the quality and completeness of extracted data (0-100)."""
scores = {}
if document_type == "design_basis":
required = ["fluid_composition", "temperature", "pressure", "flow_rate"]
optional = ["water_cut", "gor", "h2s_content", "co2_content"]
elif document_type == "equipment_data_sheet":
required = ["tag_number", "design_pressure", "design_temperature", "material"]
optional = ["dimensions", "nozzle_schedule", "weight"]
elif document_type == "heat_mass_balance":
required = ["stream_names", "temperatures", "pressures", "flows"]
optional = ["compositions", "densities", "viscosities"]
else:
required = []
optional = []
found_required = sum(1 for r in required if r in extracted_data and extracted_data[r])
found_optional = sum(1 for o in optional if o in extracted_data and extracted_data[o])
req_score = (found_required / max(len(required), 1)) * 70
opt_score = (found_optional / max(len(optional), 1)) * 30
return {
"total_score": round(req_score + opt_score),
"required_found": found_required,
"required_total": len(required),
"optional_found": found_optional,
"optional_total": len(optional),
"missing_required": [r for r in required if r not in extracted_data or not extracted_data[r]],
"missing_optional": [o for o in optional if o not in extracted_data or not extracted_data[o]],
}
PHYSICAL_BOUNDS = {
"temperature_K": (50.0, 2000.0), # 50 K to 2000 K
"pressure_bara": (0.001, 10000.0), # Near vacuum to ultra-high
"mole_fraction": (0.0, 1.0),
"flow_kg_hr": (0.0, 1e9),
"wall_thickness_mm": (0.1, 500.0),
"diameter_mm": (1.0, 100000.0), # 1 mm to 100 m
"density_kg_m3": (0.01, 25000.0),
"viscosity_cP": (0.001, 1e6),
"corrosion_rate_mm_yr": (0.0, 50.0),
}
def validate_physical_bounds(field_name, value, unit=None):
"""Check if a value is within physically reasonable bounds."""
key = field_name.lower().replace(" ", "_")
for bound_key, (low, high) in PHYSICAL_BOUNDS.items():
if bound_key in key:
if value < low or value > high:
return {
"valid": False,
"field": field_name,
"value": value,
"bounds": (low, high),
"message": f"{field_name}={value} outside bounds [{low}, {high}]"
}
return {"valid": True, "field": field_name, "value": value}
return {"valid": True, "field": field_name, "value": value, "note": "no bounds defined"}
Every extraction produces this standard JSON structure:
{
"source": {
"filename": "design_basis_rev3.pdf",
"format": "pdf",
"pages": 42,
"document_type": "design_basis",
"confidence": 0.85
},
"metadata": {
"document_title": "Design Basis for Gas Processing Facility",
"revision": "Rev 3",
"date": "2024-06-15",
"document_number": "DOC-12345"
},
"fluid_data": {
"compositions": {
"feed_gas": {
"components": {"methane": 0.85, "ethane": 0.07, "propane": 0.03, "CO2": 0.02},
"basis": "mole_fraction",
"total": 1.0
}
},
"conditions": {
"temperature": {"value": 313.15, "unit": "K", "original": "40 °C"},
"pressure": {"value": 80.0, "unit": "bara", "original": "80 bara"}
}
},
"equipment_data": [
{
"tag": "V-100",
"type": "Separator",
"service": "HP Separator",
"design_pressure": {"value": 95.0, "unit": "bara"},
"design_temperature": {"value": 373.15, "unit": "K"},
"material": "SA-516-70"
}
],
"process_data": {
"streams": {},
"operating_envelope": {}
},
"requirements": [],
"quality": {
"total_score": 78,
"missing_required": ["water_cut"],
"warnings": ["Composition sums to 0.97 — 3% unaccounted"]
}
}
The extraction result can be converted to NeqSim process JSON:
def extraction_to_neqsim_json(extraction_result):
"""Convert extraction result to NeqSim ProcessSystem.fromJson() format.
This bridges the document reader output to the process extraction pipeline.
"""
fluid_data = extraction_result.get("fluid_data", {})
compositions = fluid_data.get("compositions", {})
conditions = fluid_data.get("conditions", {})
# Pick the first/primary composition
comp_name = list(compositions.keys())[0] if compositions else None
if not comp_name:
return None
comp = compositions[comp_name]
result = {
"fluid": {
"model": "SRK",
"temperature": conditions.get("temperature", {}).get("value", 298.15),
"pressure": conditions.get("pressure", {}).get("value", 1.01325),
"mixingRule": "classic",
"components": comp["components"]
},
"process": [],
"autoRun": True
}
return result
When multiple documents describe the same system, merge data with priority rules:
def merge_extractions(extractions):
"""Merge multiple extraction results, respecting priority.
Args:
extractions: list of (extraction_result, priority) tuples,
sorted by priority (highest first)
Returns:
Merged extraction result
"""
merged = {
"sources": [],
"fluid_data": {"compositions": {}, "conditions": {}},
"equipment_data": [],
"process_data": {"streams": {}},
"requirements": [],
"conflicts": []
}
seen_tags = {} # tag -> source for conflict detection
for extraction, priority in sorted(extractions, key=lambda x: -x[1]):
source = extraction.get("source", {})
merged["sources"].append(source)
# Merge compositions (higher priority wins)
for name, comp in extraction.get("fluid_data", {}).get("compositions", {}).items():
if name not in merged["fluid_data"]["compositions"]:
merged["fluid_data"]["compositions"][name] = comp
# Merge equipment (detect conflicts on same tag)
for equip in extraction.get("equipment_data", []):
tag = equip.get("tag", "")
if tag in seen_tags:
merged["conflicts"].append({
"tag": tag,
"field": "equipment",
"source_1": seen_tags[tag],
"source_2": source.get("filename", "unknown")
})
else:
seen_tags[tag] = source.get("filename", "unknown")
merged["equipment_data"].append(equip)
# Merge requirements (accumulate all)
merged["requirements"].extend(extraction.get("requirements", []))
return merged
| Challenge | Solution |
|---|---|
| Merged cells in PDF tables | Use pdfplumber with custom table settings: table_settings={"snap_tolerance": 5} |
| Multi-line cell values | Join with space, then re-parse |
| Header row detection | Look for bold formatting or known keywords |
| Unicode issues (°, ², ³, µ) | Normalize with unicodedata.normalize("NFKD", text) |
| Tables spanning multiple pages | Detect continued tables by matching column count |
| Rotated text in PDFs | Use page.extract_text(layout=True) or OCR |
| Mixed number formats (1,000.5 vs 1.000,5) | Detect locale from document language |
| Empty/missing values | Use sentinel: None (not 0, not empty string) |
pdfplumber>=0.9.0 # PDF table extraction
python-docx>=0.8.11 # Word document reading
openpyxl>=3.1.0 # Excel reading
pandas>=1.5.0 # Tabular data handling
Optional (for advanced scenarios):
pymupdf>=1.22.0 # PREFERRED for PDF figure extraction (devtools/pdf_to_figures.py uses this)
pytesseract>=0.3.10 # OCR for scanned PDFs
pdf2image>=1.16.0 # Convert PDF pages to images for OCR
tabula-py>=2.7.0 # Alternative PDF table extraction (Java-based)
camelot-py>=0.11.0 # Another PDF table tool (good for bordered tables)