The standard Dark Horse document ingest pipeline — PDF/scan/HTML/video/social post → GCS → OCR → chunking → contextual prefix → self-hosted embedding → pgvector + FTS → claim extraction, with full provenance.
Use this flow for every document that enters the corpus: court filings, campaign finance PDFs, news articles, hearing transcripts, social posts, PFDs, contracts, anything. Scrapers should call ingestDocument() rather than reimplementing.
import { ingestDocument } from "@/lib/ingest/document-pipeline";
await ingestDocument({
sourceUrl: string, // canonical URL — required for provenance
sourceSystem: string, // e.g. "la_ethics" | "fec" | "courtlistener" | "nola_news"
title?: string,
publishedAt?: Date,
buffer?: Buffer, // for binary (PDF, image, audio, video)
textContent?: string, // for plain text (social posts, RSS articles)
metadata?: Record<string, unknown>,
});
The pipeline returns the created (or existing) Document with all chunks + claims attached.
lib/ingest/dedupe.ts computes sha256(buffer ?? textContent) and checks Document.hash. If it exists, return the existing row. No re-ingestion.
lib/wayback.ts fires SavePageNow against sourceUrl and stores the result in Document.archivedUrl. Run even if the main ingest fails — provenance first.
lib/gcs.ts writes the buffer (or text) to: