Systematic workflow for finding, downloading, and indexing engineering literature by domain. Covers the full lifecycle: discovery via standards ledger and doc index, web search for open-access PDFs, download script generation, PDF validation, catalogue YAML creation, and handoff to the 7-phase document-index-pipeline for indexing. Use when populating a new engineering domain with reference literature or when a WRK item requires domain-specific standards and textbooks.
Full lifecycle: Discover → Search → Download → Validate → Catalogue → Index
Trigger this skill when:
| Domain | Literature Path | Key Standards Bodies |
|---|---|---|
| cathodic_protection |
| /mnt/ace-data/digitalmodel/docs/domains/cathodic_protection/literature/ |
| DNV, NACE, ISO |
| geotechnical | /mnt/ace-data/digitalmodel/docs/domains/geotechnical/literature/ | API, DNV, ISO |
| hydrodynamics | /mnt/ace-data/digitalmodel/docs/domains/hydrodynamics/literature/ | DNV, ITTC, SNAME |
| naval_architecture | /mnt/ace-data/digitalmodel/docs/domains/naval_architecture/literature/ | ABS, DNV, SNAME, IMO |
| pipeline | /mnt/ace-data/digitalmodel/docs/domains/pipeline/literature/ | DNV, API, ASME, BSEE |
| structural | /mnt/ace-data/digitalmodel/docs/domains/structural/literature/ | AISC, DNV, IIW, API |
| structural-parachute | /mnt/ace-data/digitalmodel/docs/domains/structural-parachute/literature/ | NHRA, SFI, NASA, AISC |
| subsea | /mnt/ace-data/digitalmodel/docs/domains/subsea/literature/ | API, DNV, BSEE |
| metocean | /mnt/ace-data/digitalmodel/docs/domains/metocean/literature/ | DNV, API, ISO, WMO |
| Domain | Typical Target Repo |
|---|---|
| catenary | digitalmodel |
| mooring | digitalmodel |
| risers | digitalmodel |
| drilling | OGManufacturing |
| bsee | worldenergydata |
| economics | worldenergydata |
The og_standards corpus lives at /mnt/ace/docs/_standards/ organized by org:
ABS, API, ASTM, BSI, DNV, ISO, MIL, NEMA, Norsok, OnePetro, Unknown.
Inventory DB: /mnt/ace/O&G-Standards/_inventory.db (SQLite, 6.8 GB).
DOMAIN="geotechnical" # ← set your domain
LIT_DIR="/mnt/ace-data/digitalmodel/docs/domains/${DOMAIN}/literature"
mkdir -p "${LIT_DIR}"
ls -la "${LIT_DIR}"
If the domain is new, create standard subdirectories:
mkdir -p "${LIT_DIR}"/{textbooks,standards,course-notes,worked-examples}
Find standards already tracked for this domain:
uv run --no-project python scripts/data/document-index/query-ledger.py \
--domain ${DOMAIN} --verbose
Record each standard's status: gap, done, wrk_captured, reference.
Standards with gap or wrk_captured are download candidates.
Search the 1M+ record index for existing documents in this domain:
uv run --no-project python -c "
import json
from collections import Counter
matches = []
with open('data/document-index/index.jsonl') as f:
for line in f:
rec = json.loads(line)
path_lower = rec.get('path', '').lower()
summary_lower = (rec.get('summary') or '').lower()
if '${DOMAIN}' in path_lower or '${DOMAIN}' in summary_lower:
matches.append(rec)
print(f'Found {len(matches)} documents')
by_source = Counter(r['source'] for r in matches)
for s, c in by_source.most_common():
print(f' {s}: {c}')
"
Prioritize og_standards and ace_standards sources — these are already local.
Check what calculations exist vs. gaps in the target repo:
uv run --no-project python -c "
import yaml
with open('specs/capability-map/digitalmodel.yaml') as f:
data = yaml.safe_load(f)
for m in data['modules']:
if '${DOMAIN}' in m['module'].lower():
print(f\"Module: {m['module']} ({m.get('standards_count', '?')} standards)\")
for s in m.get('standards', [])[:30]:
print(f\" {s['status']:15s} {s['org']:8s} {s['id'][:70]}\")
"
Search for freely available PDFs across these source tiers:
Tier 1 — High-value free sources:
Tier 2 — Conference/journal open access:
Tier 3 — Textbooks and course notes:
WAF/paywall notes:
| Site | Issue | Action |
|---|---|---|
| eagle.org (ABS) | Cloudflare WAF blocks wget/curl | Add to pending_manual |
| archive.org borrow | HTTP 403 for borrow-only items | Add to pending_manual |
| IEEE Xplore | Paywalled unless institutional login | Skip or pending_manual |
| ASME Digital Collect | Paywall | Check og_standards DB |
Option A: Use the research-domain.py driver (queries all data sources, generates brief + script):
uv run --no-project python scripts/data/research-literature/research-domain.py \
--category ${DOMAIN} --repo digitalmodel --generate-download-script
Option B: Manual script creation from template:
#!/usr/bin/env bash
# ABOUTME: Download open-access ${DOMAIN} literature
# Usage: bash download-literature.sh [--dry-run]
set -uo pipefail
DEST="/mnt/ace-data/digitalmodel/docs/domains/${DOMAIN}/literature"
LOG_DIR="$(git rev-parse --show-toplevel)/.claude/work-queue/assets"
LOG_FILE="${LOG_DIR}/download-${DOMAIN}.log"
DRY_RUN=false
[[ "${1:-}" == "--dry-run" ]] && DRY_RUN=true
mkdir -p "${DEST}"/{textbooks,standards,course-notes,worked-examples}
mkdir -p "${LOG_DIR}"
# shellcheck source=scripts/lib/download-helpers.sh
source "$(git rev-parse --show-toplevel)/scripts/lib/download-helpers.sh"
log "=== ${DOMAIN} Literature Download ==="
log "Destination: ${DEST}"
log "Dry run: ${DRY_RUN}"
# ─── TEXTBOOKS ────────────────────────────────
log "--- Textbooks ---"
download \
"https://example.org/textbook.pdf" \
"${DEST}/textbooks" \
"Author-Year-Short-Title.pdf"
# ─── STANDARDS ────────────────────────────────
log "--- Standards ---"
download \
"https://rules.dnv.com/docs/pdf/dnvpm/codes/docs/..." \
"${DEST}/standards" \
"DNV-RP-XXXX-Title-Year.pdf" || true
log "=== Download complete ==="
total=$(find "${DEST}" -name "*.pdf" | wc -l)
log " Total PDFs: ${total}"
Save script to: /mnt/ace-data/digitalmodel/docs/domains/${DOMAIN}/literature/download-literature.sh
Key script patterns:
source scripts/lib/download-helpers.sh for the download and log functionsset -uo pipefail (NOT set -e) — download failures should log, not abort|| true or || log "NOTE: ..."Author-Year-Short-Title.pdfdownload function auto-skips existing files (resume-safe)--dry-run first to preview# Dry run first
bash download-literature.sh --dry-run
# Execute
bash download-literature.sh
# Validate all PDFs are real PDFs (not HTML/WAF responses)
find "${LIT_DIR}" -name "*.pdf" -exec file {} \; | grep -v "PDF document"
Any file that shows "HTML document" or "ASCII text" instead of "PDF document"
is a WAF response. Move it to a _failed/ directory and add to pending_manual.
Write knowledge/seeds/${DOMAIN}-resources.yaml: