chunking emueddings
Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration
Location: crates/kreuzberg/src/chunking/, crates/kreuzberg/src/embeddings.rs
Extracted Text
|
[1. Normalization] -> Clean whitespace, remove control chars
|
[2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive
|
[3. Overlap Management] -> Control context window overlap
|
[4. Optional Embedding] -> Generate vectors with FastEmbed
|
Output: Vec<Chunk> with text, vectors, metadata
Location: crates/kreuzberg/src/chunking/mod.rs
| Strategy | Pattern | Best For |
|---|---|---|
| Fixed-Size | Sliding window with configurable overlap |
| Uniform chunks for embedding models with fixed token limits |
| Semantic | Split by sentences, merge/split by similarity threshold | Smart context preservation for LLM consumption and semantic search |
| Syntax-Aware | Split by paragraph/section/heading/code-block structure | Preserving document structure (sections, code blocks) in RAG |
| Recursive (LangChain pattern) | Try separators in order: \n\n, \n, , | Best general-purpose chunking; auto-finds optimal split points |
Key config fields per strategy (see struct definitions in chunking/mod.rs):
chunk_size, overlap, trim_whitespacetarget_chunk_size, min/max_chunk_size, semantic_threshold, use_sentence_boundarieschunk_by (Paragraph/Section/Heading/Sentence/CodeBlock), max_chunk_size, respect_code_blocksseparators[], chunk_size, overlapLocation: crates/kreuzberg/src/chunking/mod.rs
| Preset | Chunk Size | Overlap | Strategy | Use Case |
|---|---|---|---|---|
| Balanced | 512 tokens | 50 | Semantic | RAG sweet spot |
| Compact | 256 tokens | 32 | Fixed-Size | Dense vectors |
| Extended | 1024 tokens | 100 | Recursive | Full context |
| Minimal | 128 tokens | 16 | (default) | Lightweight embeddings |
Usage: set config.chunking.preset = Some("balanced") in ExtractionConfig.
Location: crates/kreuzberg/src/embeddings.rs
| Model | Dimensions | Notes |
|---|---|---|
BAAI/bge-small-en-v1.5 (default) | 384 | Fast, excellent for RAG |
BAAI/bge-small-zh-v1.5 | 384 | Chinese optimized |
BAAI/bge-base-en-v1.5 | 768 | Better quality, slower |
jinaai/jina-embeddings-v2-base-en | 768 | Long context (up to 8192 tokens) |
Custom(path) | varies | Custom ONNX model path |
TextEmbeddingManager provides singleton-cached models per config. Pattern:
get_or_init_model() -- lazy-loads ONNX model (downloads if needed), caches in Arc<RwLock<HashMap>>embed_chunks() -- collects chunk texts, calls model.embed(texts, batch_size), zips results back to ChunkWithEmbeddingDefault config: batch_size=256, device=CPU, parallel_requests=4.
Embeddings require ONNX Runtime. Feature-gated via:
[features]
embeddings = ["dep:fastembed", "dep:ort"]
Install: brew install onnxruntime (macOS) / apt install libonnxruntime libonnxruntime-dev (Linux). Verify: echo $ORT_DYLIB_PATH.
The full extraction-to-RAG pipeline:
extract_file(path, config) -> ExtractionResultresult.content -> Vec<Chunk>TextEmbeddingManager::embed_chunks() -> Vec<ChunkWithEmbedding>RagDocument { file_path, metadata, chunks } ready for vector DB ingestionSee ChunkWithEmbedding struct in types.rs: contains text, embedding: Vec<f32>, dimensions, norm, metadata.