Detailed architecture documentation including design decisions, processing patterns, and system internals. Use when discussing architecture, understanding why something was built a certain way, or planning significant changes.
┌─────────────────────────────────────────────────────┐
│ Claude Desktop / MCP Client │
└────────────────┬────────────────────────────────────┘
│ STDIO Transport (JSON-RPC)
▼
┌─────────────────────────────────────────────────────┐
│ MCP Java SDK (STDIO Transport) │
│ ┌──────────────────────────────────────────────┐ │
│ │ LuceneSearchTools (MCP Tools) │ │
│ └──────────────────────────────────────────────┘ │
└─────────┬───────────────────────────┬───────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ LuceneIndexService │ │ DocumentCrawler │
│ - Search & Index │ │ Service │
│ - NRT Manager │ │ - File Discovery │
│ - Admin Operations │ │ - Content Extraction │
└──────────┬───────────┘ └──────────┬───────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────┐
│ Apache Lucene 10.3 + Apache Tika │
└─────────────────────────────────────────────────────┘
LuceneserverApplication.javadeployed profile)The server uses multiple analyzer chains optimized for different search patterns. Each document is indexed with several shadow fields, each using a different analyzer:
UnicodeNormalizingAnalyzer: Primary content field with NFKC normalization, diacritic folding, ligature expansionReverseUnicodeNormalizingAnalyzer: Reversed tokens for efficient leading wildcard queries (*vertrag)OpenNLPLemmatizingAnalyzer: Dictionary-based lemmatization for German and English (handles irregular forms)GermanTransliteratingAnalyzer: Maps ASCII digraphs to umlauts (Mueller→Müller)See PIPELINE.md for complete analyzer chain documentation and query pipeline details.
Directory Walkers (N threads) ──> LinkedBlockingQueue ──> Batch Processor (1 thread)
batch-size and batch-timeout-msEnvironment Variable > ~/.mcplucene/config.yaml > application.yaml
When LUCENE_CRAWLER_DIRECTORIES env var is set, MCP config tools return errors.
operationIdgetIndexAdminStatus()The document crawler uses a multi-layered architecture:
DocumentCrawlerService - Main orchestrator
FileContentExtractor - Apache Tika integration
DocumentIndexer - Lucene document builder
DirectoryWatcherService - File system monitoring
CrawlExecutorService - Thread pool management
CrawlStatisticsTracker - Progress tracking
IndexReconciliationService - Incremental indexing
Every byte returned by an MCP tool is a token consumed by the AI client. This is a first-class design constraint — treat response size the same way you would treat latency or memory.
Rules for MCP tool responses:
max-passage-char-length (default: 200) using a highlight-centred window that trims irrelevant leading/trailing text while preserving <em> tags and surrounding context.max-passages is 3 per document — enough for relevance judgement, not exhaustive coverage. Users who need more can retrieve the full document via getDocument.(fields per result) × (avg field size) × (max results) × (max passages). Keep it under ~10 KB per tool call as a guideline.Example: The passage system was redesigned to reduce worst-case per-search token load from ~25,000 chars (5 passages × 500 chars × 10 results) to ~6,000 chars (3 passages × 200 chars × 10 results) — a 76% reduction with better precision because passages are now individually extracted and windowed around highlights.
| Limitation | Reason | Workaround |
|---|---|---|
| Lexical search only | Simplicity, no ML dependencies | AI generates OR queries for synonyms |
| Single-node only | Target: personal document collections | Vertical scaling |
| STDIO only | Claude Desktop requirement | Could add SSE transport |
| No auth | Single-user desktop deployment | OS-level sandboxing |
See the /improvements skill for the full prioritized roadmap, trade-off analysis, and rejected ideas with reasoning.