Preserve

Deterministic parsing only. Do not add any LLM calls here.
Prefer Unstructured for structure-first parsing, but preserve the current lightweight PDF fallback in parser.py so tests can run without the full inference stack.
Keep chunk_id in the format {doc_id}:{element_id}:{chunk_index} and locator in the format {doc_id}:page{page}.
Preserve the current normalized element types: title, heading, paragraph, list_item, table, figure, caption.

Pptx Ingestion