extraction pipeline patterns
Kreuzberg's format detection -> extraction -> fallback orchestration for 75+ file formats
The extraction pipeline (crates/kreuzberg/src/core/pipeline.rs, crates/kreuzberg/src/extraction/) orchestrates:
core/pipeline.rs)Location: crates/kreuzberg/src/core/mime.rs, crates/kreuzberg/src/core/formats.rs
Pattern: detect via magic bytes, validate extension alignment (prevent spoofing), route to extractor. Multiple extractors for same format -> choose highest confidence/specificity.
// Pseudocode: core/mime.rs
match (magic_bytes(content), extension) {
(Some(fmt), Some(ext)) if aligned -> Ok(fmt),
(Some(fmt), Some(ext)) if misaligned -> Err(FormatMismatch),
(Some(fmt), None) -> Ok(fmt), // magic bytes only
(None, Some(ext)) -> Ok(from_extension(ext)),
_ -> Err(UnknownFormat),
}
| Category | Extractors | Key Modules |
|---|---|---|
| Office | DOCX, XLSX, XLSM, XLSB, XLS, PPTX, ODP, ODS | extraction/{docx,excel,pptx}.rs |
| Standard + encrypted, password attempts | pdf/ subdirectory (13 files) | |
| Images | PNG, JPG, TIFF, WebP, JP2, SVG (OCR-enabled) | extraction/image.rs + ocr/ |
| Web | HTML, XHTML, XML, SVG (DOM parsing) | extraction/html.rs (67KB - complex table handling) |
| EML, MSG (headers, body, attachments, threading) | extraction/email.rs | |
| Archives | ZIP, TAR, GZ, 7Z (recursive extraction) | extraction/archive.rs (31KB) |
| Markdown | MD, TXT, RST, Org Mode, RTF | extraction/markdown.rs |
| Academic | LaTeX, BibTeX, JATS, Jupyter, DocBook | extraction/{structured,xml}.rs |
// Pseudocode: extraction/mod.rs
let format = detect_format(source.bytes, source.extension);
let result = match format {
Pdf -> extract_pdf(source, config),
Docx -> extract_docx(source, config),
Image -> extract_image_with_ocr_fallback(source, config),
Archive -> extract_archive_recursive(source, config),
_ -> extract_with_plugin(format, source, config),
};
run_pipeline(result, config) // post-processing always runs
is_encrypted=true in metadata on failureLocation: crates/kreuzberg/src/core/config.rs, crates/kreuzberg/src/core/config_validation.rs
ExtractionConfig holds format-specific configs (pdf, image, html, office), fallback orchestration (fallback), and post-processing (postprocessor, chunking, keywords). See struct definition in config.rs.
Location: crates/kreuzberg/src/plugins/
Plugin registry loaded at startup, cached for zero-cost lookup.
Location: Cargo.toml (workspace), crates/kreuzberg/Cargo.toml, FEATURE_MATRIX.md
20+ features across 9 language bindings. Key feature groups:
| Group | Features | Notes |
|---|---|---|
| OCR | tesseract (default), tesseract-static, ocr-minimal | Mutually exclusive recommendation |
| Formats | pdf, pdf-minimal, office, office-minimal | |
| AI/ML | embeddings (requires ONNX), keywords-yake, keywords-rake, language-detection | |
| Server | api (Axum), mcp, tokio-runtime, lite-runtime | |
| Bindings | python-bindings, ruby-bindings, php-bindings, node-bindings, wasm |
Conditional compilation: modules gated with #[cfg(feature = "...")]. Runtime validate_config() warns if requested feature not compiled in.
ocr-minimal + tesseract should error at compile timerun_pipeline() for validators/hooks