Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins.
Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
Use this skill when writing code that:
pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr] # EasyOCR
npm install @kreuzberg/node
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
# embeddings, language-detection, keywords-yake, keywords-rake
# Download from GitHub releases, or:
cargo install kreuzberg-cli
from kreuzberg import extract_file
result = await extract_file("document.pdf")
print(result.content) # extracted text
print(result.metadata) # document metadata
print(result.tables) # extracted tables
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content)
import { extractFile } from '@kreuzberg/node';
const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
import { extractFileSync } from '@kreuzberg/node';
const result = extractFileSync('document.pdf');
use kreuzberg::{extract_file, ExtractionConfig};
#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file("document.pdf", None, &config).await?;
println!("{}", result.content);
Ok(())
}
tokio-runtime featureuse kreuzberg::{extract_file_sync, ExtractionConfig};
fn main() -> kreuzberg::Result<()> {
let config = ExtractionConfig::default();
let result = extract_file_sync("document.pdf", None, &config)?;
println!("{}", result.content);
Ok(())
}
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
All languages use the same configuration structure with language-appropriate naming conventions.
from kreuzberg import (
ExtractionConfig, OcrConfig, TesseractConfig,
PdfConfig, ChunkingConfig,
)
config = ExtractionConfig(
ocr=OcrConfig(
backend="tesseract",
language="eng",
tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
),
pdf_options=PdfConfig(passwords=["secret123"]),
chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
output_format="markdown",
)
result = await extract_file("document.pdf", config=config)
import { extractFile, type ExtractionConfig } from '@kreuzberg/node';
const config: ExtractionConfig = {
ocr: { backend: 'tesseract', language: 'eng' },
pdfOptions: { passwords: ['secret123'] },
chunking: { maxChars: 1000, maxOverlap: 200 },
outputFormat: 'markdown',
};
const result = await extractFile('document.pdf', null, config);
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
let config = ExtractionConfig {
ocr: Some(OcrConfig {
backend: "tesseract".into(),
language: "eng".into(),
..Default::default()
}),
chunking: Some(ChunkingConfig {
max_characters: 1000,
overlap: 200,
..Default::default()
}),
output_format: OutputFormat::Markdown,
..Default::default()
};
let result = extract_file("document.pdf", None, &config).await?;
output_format = "markdown"
[ocr]
backend = "tesseract"
language = "eng"
[chunking]
max_chars = 1000
max_overlap = 200
[pdf_options]
passwords = ["secret123"]
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
from kreuzberg import batch_extract_files, batch_extract_files_sync
# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
for result in results:
print(f"{len(result.content)} chars extracted")
import { batchExtractFiles } from '@kreuzberg/node';
const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);
tokio-runtime featureuse kreuzberg::{batch_extract_file, ExtractionConfig};
let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
pip install kreuzberg[easyocr]. Pass easyocr_kwargs={"gpu": True}.paddleocr_kwargs={"use_angle_cls": True}.GutenOcrBackend.config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed
config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable
| Field | Python | Node.js | Rust | Description |
|---|---|---|---|---|
| Text content | result.content | result.content | result.content | Extracted text (str/String) |
| MIME type | result.mime_type | result.mimeType | result.mime_type | Input document MIME type |
| Metadata | result.metadata | result.metadata | result.metadata | Document metadata (dict/object/HashMap) |
| Tables | result.tables | result.tables | result.tables | Extracted tables with cells + markdown |
| Languages | result.detected_languages | result.detectedLanguages | result.detected_languages | Detected languages (if enabled) |
| Chunks | result.chunks | result.chunks | result.chunks | Text chunks (if chunking enabled) |
| Images | result.images | result.images | result.images | Extracted images (if enabled) |
| Elements | result.elements | result.elements | result.elements | Semantic elements (if element_based format) |
| Pages | result.pages | result.pages | result.pages | Per-page content (if page extraction enabled) |
| Keywords | result.keywords | result.keywords | result.keywords | Extracted keywords (if enabled) |
from kreuzberg import (
extract_file_sync, KreuzbergError, ParsingError,
OCRError, ValidationError, MissingDependencyError,
)