Parent/chunk document architecture and hybrid search implementation (BM25 + kNN + RRF) in aithena's Solr schema
Apply this skill when modifying Solr queries, adding schema fields, changing search modes, or reviewing PRs that touch search_service.py, managed-schema.xml, or document-indexer chunking logic. The parent/chunk split is the most common source of correctness bugs in this project.
Parent documents (books):
id = SHA-256 of file path (unique per book)title_s/t, author_s/t, year_i, category_s, series_s, language_detected_s, file_path_s, folder_path_s, page_count_i, file_size_lbook_embedding (512D) for book-level similarityparent_id_s field — this is how you identify a parentChunk documents (text fragments):
id = {parent_id}_chunk_{index} (index is zero-padded, e.g. {parent_id}_chunk_0000)parent_id_s = parent book's id (foreign key)chunk_text_t = extracted text (400 words, 50-word overlap, page-aware)embedding_v = 512D dense vector (HNSW cosine) — primary kNN search fieldchunk_index_i, page_start_i, page_end_i for positioningKeyword (BM25):
EXCLUDE_CHUNKS_FQ = "-parent_id_s:[* TO *]" to return only parent documentsSemantic (kNN):
embedding_v)Hybrid (RRF):
| Field purpose | Add to parent? | Add to chunk? | Why |
|---|---|---|---|
| Book metadata (author, year) | Yes | Copy from parent | Chunks need it for display after kNN |
| Full-text search field | Yes (via Tika) | No (use chunk_text_t) | Tika extracts to parent; chunks have own text |
| Dense vector embedding | Optional (book_embedding) | Yes (embedding_v) | kNN searches chunks, not parents |
| Facet field | Yes | Not needed | Facets come from BM25 leg (parents only) |
hashlib.sha256(file_path.encode()).hexdigest()f"{parent_id}-chunk-{chunk_index}"parent_id_sEXCLUDE_CHUNKS_FQ to kNN queries — this silently returns zero results since chunks are the only documents with embeddings. (Source: PR #701 incident)book_embedding is optional; embedding_v on chunks is the primary vector field.params = {
"q": "{!knn f=embedding_v topK=10}[0.5, -0.2, ...]",
# NO fq excluding chunks — chunks ARE the target
}
params = {
"q": "search terms",
"defType": "edismax",
"fq": ["-parent_id_s:[* TO *]"], # parents only
}
solr.delete(q=f'id:"{book_id}" OR parent_id_s:"{book_id}"')
solr.commit()
Keyword (BM25):
_text_ (default), phrase boost: title_t^2*:* (returns everything)-parent_id_s:[* TO *])Semantic (kNN):
{!knn} local-parameter syntax on embedding_v fieldPOST /v1/embeddings/Hybrid (RRF):
score = sum(1/(k + rank)), k=60def reciprocal_rank_fusion(keyword_results, semantic_results, k=60):
scores = {}
result_map = {}
for rank, doc in enumerate(keyword_results, start=1):
scores[doc["id"]] = 1.0 / (k + rank)
result_map[doc["id"]] = doc
for rank, doc in enumerate(semantic_results, start=1):
scores[doc["id"]] = scores.get(doc["id"], 0.0) + 1.0 / (k + rank)
if doc["id"] not in result_map:
result_map[doc["id"]] = doc
return sorted by scores descending, with RRF score replacing original score
Key properties:
RRF_K env varmax(page_size * 2, 20) for adequate fusionCall pattern:
response = httpx.post(EMBEDDINGS_URL, json={"input": query_text}, timeout=EMBEDDINGS_TIMEOUT)
vector = response.json()["data"][0]["embedding"] # 512-dim float list
Fallback chain:
*:* results (keyword)Timeout alignment (critical):
EMBEDDINGS_TIMEOUT)proxy_read_timeout: must be >= 1.5x embeddings timeout (180s)kNN vectors are 512 floats serialized as JSON arrays — easily >4KB. Combined with filter queries, this exceeds GET URI limits. Always use POST request body for Solr queries. (Source: #706)
| Source | keyword | semantic | hybrid |
|---|---|---|---|
| Facets | Solr facet_counts | None | From BM25 leg |
| Highlights | Solr highlighting | None | From BM25 leg |
| Sort | Solr-native | By cosine score | By RRF score |
Facet fields are defined in FACET_FIELDS dict mapping logical names to Solr field tuples. Multi-field facets (e.g., language uses both language_detected_s and language_s) fall back to the first non-empty field.
All facet filter values must be Lucene-escaped before inclusion in fq parameters to prevent Solr query injection. Use the solr_escape() utility function (typically via build_filter_queries, which applies it for you).
distiluse-base-multilingual-cased-v2 (512 dimensions)embeddings-server at POST /v1/embeddings/book_embedding): computed during indexing, used for similar-booksembedding_v): computed per chunk during indexing, used for searchsolr-search calling embeddings-serverAll search modes support the same filter query parameters:
fq_author, fq_category, fq_language, fq_yeardocs/architecture/solr-data-model.md — full architecture referencesrc/solr-search/search_service.py — RRF implementation, EXCLUDE_CHUNKS_FQ constant, query builderssrc/solr/books/managed-schema.xml — field definitionssrc/solr-search/README.md — data model summary