Retrieves engineering documents (compressor curves, mechanical drawings, data sheets, vendor docs) from document management systems for use in NeqSim engineering tasks. Supports local directories, manual upload, and pluggable retrieval backends (e.g., stidapi for STID). USE WHEN: a task needs vendor performance data, mechanical drawings, or as-built documentation for process equipment.
Retrieve engineering documents (compressor curves, mechanical drawings, data sheets, vendor reports) for use in NeqSim task-solving workflows.
ALL downloaded documents — STID drawings, PI historian exports,
vendor datasheets, P&IDs, literature PDFs — MUST be saved to
step1_scope_and_research/references/ within the task folder.
NEVER download or save task-related files to workspace-level directories like
output/, figures/, or any path outside task_solve/YYYY-MM-DD_slug/.
# CORRECT — saves inside the task folder:
TASK_DIR = "task_solve/YYYY-MM-DD_slug"
out_dir = os.path.join(TASK_DIR, "step1_scope_and_research", "references")
# WRONG — saves outside the task folder:
out_dir = os.path.join(os.path.dirname(__file__), "..", "figures", "stid_docs") # NEVER
out_dir = "output/stid_docs" # NEVER
For PDF-to-PNG conversion: Output to the task's figures/:
python devtools/pdf_to_figures.py task_solve/YYYY-MM-DD_slug/step1_scope_and_research/references/ \
--outdir task_solve/YYYY-MM-DD_slug/figures/
This rule ensures every task is self-contained and portable.
This skill is backend-agnostic — it works with any document source:
The task solver checks these sources in order:
step1_scope_and_research/references/devtools/doc_retrieval_config.yaml — this file is gitignored)This means the workflow works for everyone:
references/ and the same pipeline runsPlace documents in the task's references folder:
task_solve/YYYY-MM-DD_task_slug/
└── step1_scope_and_research/
└── references/
├── compressor_curves.pdf
├── mechanical_drawing.pdf
└── equipment_datasheet.pdf
Or point to an existing directory when creating the task:
python devtools/new_task.py "compressor analysis" --type B \
--refs-dir "/path/to/existing/docs"
The task solver will automatically:
references/devtools/pdf_to_figures.py)view_imageWhen a retrieval backend is configured via devtools/doc_retrieval_config.yaml
(gitignored — never committed), the task solver can auto-fetch documents
by equipment tag. See the config template below for setup instructions.
Use devtools/stid_download.py to download STID documents directly into a
task folder. This ensures all documents end up in the right place:
# Download documents by tag — saves to task's references/ folder
python devtools/stid_download.py --task-dir task_solve/2026-04-16_my_task \
--inst MYINST --tags 30PT0001 30PT0002 33AI0001
# Download + convert to PNG for AI analysis
python devtools/stid_download.py --task-dir task_solve/2026-04-16_my_task \
--inst MYINST --tags 30PT0001 --convert-png
# Download specific document numbers
python devtools/stid_download.py --task-dir task_solve/2026-04-16_my_task \
--inst MYINST --docs E001-AS-P-XB-00001-01 E001-AS-BI000-DS-00001
The helper:
step1_scope_and_research/references/ inside the task folderstid_retrieval_manifest.json for traceabilityfigures/ directory# Generic retrieval interface used by the task solver:
from devtools.doc_retriever import retrieve_documents
docs = retrieve_documents(
tags=['35-KA001A'],
doc_types=['CE', 'AA', 'MD', 'DS'],
output_dir='step1_scope_and_research/references/'
)
# Returns list of downloaded file paths, or [] if no backend configured
| Code | Type | When Relevant |
|---|---|---|
CE | Performance Curves / Calculations | Compressor, pump, turbine analysis |
DS | Data Sheet | Any equipment analysis |
AA | General Arrangement Drawing | Physical layout, sizing |
MD | Mechanical Drawing | Detailed dimensions, nozzles |
RV | Vendor Manual / Report | Operating procedures, maintenance |
RE | Report | Background reference |
ER | Assembly / Erection Drawing | Installation, coupling details |
PL | Parts List | Spare parts, BOM |
PI | P&ID | Process topology |
PF | PFD | Process flow overview |
IN | Instrument Data Sheet | Control system design |
SP | Specification | Material/piping requirements |
The task solver filters documents by relevance to avoid wasting time on irrelevant content. Only documents above the relevance threshold are extracted and analyzed:
DOC_RELEVANCE = {
'compressor_analysis': {
'CE': 1.0, # Performance curves — essential
'DS': 0.9, # Data sheet — essential
'AA': 0.7, # General arrangement — useful
'MD': 0.6, # Mechanical drawing — useful
'ER': 0.6, # Assembly drawing — useful
'RV': 0.5, # Vendor manual — background
'RE': 0.4, # Report — background
'PL': 0.2, # Parts list — skip
'SP': 0.3, # Specification — skip
},
'heat_exchanger_analysis': {
'DS': 1.0, 'CE': 0.9, 'AA': 0.7, 'MD': 0.6, 'RV': 0.5,
},
'separator_analysis': {
'DS': 1.0, 'AA': 0.9, 'PI': 0.8, 'MD': 0.6, 'IN': 0.7,
},
'pipeline_design': {
'DS': 1.0, 'SP': 0.9, 'CE': 0.7, 'MD': 0.6,
},
'general': {
'DS': 1.0, 'CE': 0.9, 'AA': 0.7, 'PI': 0.7, 'MD': 0.6,
'RV': 0.5, 'RE': 0.4, 'ER': 0.4, 'IN': 0.5, 'SP': 0.4,
'PL': 0.2, 'PF': 0.6,
},
}
def filter_relevant_docs(doc_list, task_type, min_relevance=0.5):
"""Filter documents by relevance to the task type.
Args:
doc_list: List of dicts with at least 'docType' or 'doc_type' key
task_type: One of the keys in DOC_RELEVANCE
min_relevance: Minimum score to keep (default 0.5)
Returns:
(relevant, filtered_out) — two lists
"""
relevance_map = DOC_RELEVANCE.get(task_type, DOC_RELEVANCE['general'])
relevant, filtered_out = [], []
for doc in doc_list:
dtype = doc.get('docType') or doc.get('doc_type', '')
score = relevance_map.get(dtype, 0.0)
if score >= min_relevance:
relevant.append({**doc, '_relevance': score})
else:
filtered_out.append({**doc, '_relevance': score,
'_reason': f'Below threshold ({score} < {min_relevance})'})
return relevant, filtered_out
After documents are in references/, convert to images for AI analysis:
import fitz # pymupdf
def pdf_to_pngs(pdf_path, output_dir, dpi=200):
"""Convert PDF pages to numbered PNG images."""
import os
doc = fitz.open(pdf_path)
base = os.path.splitext(os.path.basename(pdf_path))[0]
paths = []
for i, page in enumerate(doc):
pix = page.get_pixmap(dpi=dpi)
out = os.path.join(output_dir, f"{base}_page{i+1}.png")
pix.save(out)
paths.append(out)
doc.close()
return paths
Or use the built-in utility:
python devtools/pdf_to_figures.py step1_scope_and_research/references/ --outdir figures/
Then use view_image on extracted PNGs to read compressor curves,
mechanical drawings, and data sheets.
After retrieval/classification, create a manifest for traceability:
manifest = {
"source": "local" | "backend" | "manual",
"retrieval_date": "2026-04-16",
"task_type": "compressor_analysis",
"tags_searched": ["35-KA001A", "35-KA001B"],
"documents_retrieved": [
{
"filename": "performance_curves.pdf",
"doc_type": "CE",
"title": "Performance Curves Compressor B",
"relevance": 1.0,
"pages": 41,
"used_in_analysis": True
}
],
"documents_filtered_out": [
{
"filename": "parts_list.pdf",
"doc_type": "PL",
"title": "Spare Parts List",
"relevance": 0.2,
"reason": "Below relevance threshold (0.5)"
}
]
}
# Save as step1_scope_and_research/retrieval_manifest.json
The task solver uses this manifest to:
## Data Sources
- **Equipment tags:** 35-KA001A, 35-KA001B (export compressors)
- **Document source:** Local directory / Auto-retrieval / User-provided
- **Key documents used:**
- performance_curves.pdf: Vendor performance maps (41 pages)
- as_built_curves.pdf: Shop test results (4 pages)
- general_arrangement.pdf: GA drawing with dimensions
- **Documents filtered out:** 8 (parts lists, generic specs — below relevance)
# Load retrieval manifest to know what's available
import json
manifest_path = TASK_DIR / 'step1_scope_and_research' / 'retrieval_manifest.json'
if manifest_path.exists():
with open(manifest_path) as f:
manifest = json.load(f)
# Work only with relevant documents
curve_docs = [d for d in manifest['documents_retrieved']
if d['doc_type'] == 'CE' and d['used_in_analysis']]
print(f"Analyzing {len(curve_docs)} performance curve documents")
{
"data_sources": {
"retrieval_method": "local",
"documents_retrieved": 13,
"documents_analyzed": 5,
"documents_filtered_out": 8,
"key_documents": [
"performance_curves.pdf — Vendor Performance Maps",
"as_built_curves.pdf — Shop Test Results"
]
}
}
from neqsim import jneqsim
# Create compressor with performance curves from extracted data
compressor = jneqsim.process.equipment.compressor.Compressor("Export Comp", feed)
# If curve data has been digitized from the images:
chart = compressor.getCompressorChart()
chart.setHeadUnit("kJ/kg")
chart.setUseCompressorChart(True)
# Add speed curves (extracted from performance map)
for speed, points in curve_data.items():
curve = jneqsim.process.equipment.compressor.CompressorCurve(speed)
for flow, head, eff in points:
curve.addCurveDataPoint(flow, head, eff)
chart.addCurve(curve)
Users can always add documents manually to references/, even when a
retrieval backend is configured. The two approaches coexist:
step1_scope_and_research/references/
├── [auto-retrieved] performance_curves_35KA001A.pdf (from backend)
├── [auto-retrieved] datasheet_35KA001A.pdf (from backend)
├── [manual] vendor_email_attachment.pdf (user dropped in)
├── [manual] field_test_report_2025.xlsx (user dropped in)
└── [manual] photo_nameplate.jpg (user dropped in)
The retrieval manifest tracks the source of each document:
{
"documents_retrieved": [
{"filename": "performance_curves.pdf", "source": "backend", "doc_type": "CE"},
{"filename": "vendor_email_attachment.pdf", "source": "manual", "doc_type": "RE"},
{"filename": "field_test_report_2025.xlsx", "source": "manual", "doc_type": "DS"}
]
}
Rules:
references/ at any time;
the agent should re-scan the folder if it detects new filesThe initial retrieval in Step 1 may not cover everything. During Step 2 (analysis), the agent may discover it needs additional documents — for example:
When the agent identifies a data gap during analysis, it follows this