Ingesting a document into Dark Horse

Use this flow for every document that enters the corpus: court filings, campaign finance PDFs, news articles, hearing transcripts, social posts, PFDs, contracts, anything. Scrapers should call ingestDocument() rather than reimplementing.

Entry point

import { ingestDocument } from "@/lib/ingest/document-pipeline";

await ingestDocument({
  sourceUrl: string,        // canonical URL — required for provenance
  sourceSystem: string,     // e.g. "la_ethics" | "fec" | "courtlistener" | "nola_news"
  title?: string,
  publishedAt?: Date,
  buffer?: Buffer,          // for binary (PDF, image, audio, video)
  textContent?: string,     // for plain text (social posts, RSS articles)
  metadata?: Record<string, unknown>,
});

The pipeline returns the created (or existing) Document with all chunks + claims attached.

Flow

Ingesting a document into Dark Horse

Entry point

import { ingestDocument } from "@/lib/ingest/document-pipeline";

await ingestDocument({
  sourceUrl: string,        // canonical URL — required for provenance
  sourceSystem: string,     // e.g. "la_ethics" | "fec" | "courtlistener" | "nola_news"
  title?: string,
  publishedAt?: Date,
  buffer?: Buffer,          // for binary (PDF, image, audio, video)
  textContent?: string,     // for plain text (social posts, RSS articles)
  metadata?: Record<string, unknown>,
});

The pipeline returns the created (or existing) Document with all chunks + claims attached.

Ingest Document

Ingesting a document into Dark Horse

Entry point

Flow

Ingest Document

Ingesting a document into Dark Horse

Entry point

Flow

1 — Hash-based dedupe

2 — Wayback snapshot

3 — GCS raw write

Feishu Doc

Summarize

Nano Pdf

Diffs

Customs Trade Compliance

Nutrient Document Processing