Skill File

PageIndex RAG Architecture

Name: PageIndex RAG Architecture
Author: mmtmr

Build reasoning-based RAG systems using PageIndex architecture. Replaces vector databases with hierarchical table-of-contents indices and LLM-driven navigation. Use when (1) implementing RAG for long structured documents (financial reports, legal contracts, technical manuals), (2) improving existing vector-based RAG systems with poor accuracy on structured content, (3) designing document indexing strategies with semantic coherence, (4) explaining PageIndex concepts including reasoning-based retrieval, hierarchical navigation, and cross-reference following, or (5) handling documents with internal references and multi-turn conversations. Focuses on technical architecture, core research insights, and practical implementation patterns rather than using the PageIndex package directly.

mmtmr0 starsFeb 10, 2026

Occupation
Categories: Knowledge Base

Skill Content

PageIndex replaces vector-based similarity search with LLM-driven hierarchical navigation, achieving 98.7% accuracy on financial document benchmarks by reasoning through document structure instead of matching embeddings.

Core Innovation: Why Vector RAG Fails

Query-Knowledge Mismatch: Vector similarity measures surface semantics, not task relevance. "What are debt trends?" matches "trends" mentions, not actual trend analysis.

Hard Chunking: Fixed 512-1000 token chunks fragment mid-sentence, breaking contextual continuity. Financial statements split across chunks lose asset-liability relationships.

Context Window Deterioration: Retrieving 10-20 chunks creates needle-in-haystack problems where relevant info gets buried.

Cross-Reference Blindness: Cannot follow "see Appendix G" or "Section 3.2" references without manual preprocessing.

PageIndex Solution

Replace vector databases with hierarchical tree indices stored as JSON:

Related Skills

PageIndex RAG Architecture | Skills Pool

Skill File

PageIndex RAG Architecture

mmtmr0 starsFeb 10, 2026

Occupation
Categories: Knowledge Base

Skill Content

Core Innovation: Why Vector RAG Fails

Query-Knowledge Mismatch: Vector similarity measures surface semantics, not task relevance. "What are debt trends?" matches "trends" mentions, not actual trend analysis.

Hard Chunking: Fixed 512-1000 token chunks fragment mid-sentence, breaking contextual continuity. Financial statements split across chunks lose asset-liability relationships.

Context Window Deterioration: Retrieving 10-20 chunks creates needle-in-haystack problems where relevant info gets buried.

Cross-Reference Blindness: Cannot follow "see Appendix G" or "Section 3.2" references without manual preprocessing.

PageIndex Solution

Replace vector databases with hierarchical tree indices stored as JSON:

Related Skills

{
  "node_id": "section_2_1",
  "name": "Financial Assets",
  "description": "Current and long-term financial assets including marketable securities",
  "start_index": 12,
  "end_index": 15,
  "nodes": [...]
}

def extract_toc_from_pdf(pdf_path: str, toc_pages: int = 20) -> List[dict]:
    """
    Parse table of contents from first N pages
    Returns: [{title, page, level}, ...]
    """
    # Detect ToC patterns:
    # - Lines with page numbers: "Section 2.1 ..... 42"
    # - Indentation indicating hierarchy
    # - Keywords: Chapter, Section, Appendix

def build_tree(toc_entries: List[dict]) -> TreeNode:
    """
    Convert flat ToC to nested tree structure
    Assigns node_ids, page ranges, hierarchical relationships
    """

def generate_descriptions(node: TreeNode, doc_path: str):
    """
    LLM creates semantic descriptions per section:
    - Key topics covered
    - Type of information (data, analysis, methodology)
    - Relevant domain concepts
    """

def select_relevant_nodes(
    query: str,
    tree: TreeNode,
    conversation_history: List[str] = None
) -> List[TreeNode]:
    """
    LLM reasons over tree structure:
    1. What type of information does query require?
    2. Which sections' descriptions indicate relevance?
    3. Consider conversation history (prior focus areas)

    Returns 1-3 most promising nodes
    """

def extract_content_range(doc_path: str, start_page: int, end_page: int) -> str:
    """
    Retrieve exact page ranges (preserves semantic boundaries)
    Each node = 5-15 pages typically
    """

def evaluate_sufficiency(query: str, collected_context: str) -> dict:
    """
    LLM meta-reasoning:
    - Does context contain data needed to answer?
    - Are there gaps requiring more information?
    - Does text reference another section?

    Returns: {status: "sufficient" | "insufficient" | "follow_reference"}
    """

def follow_cross_reference(context: str, tree: TreeNode) -> TreeNode:
    """
    Detect patterns: "see Appendix G", "discussed in Section 2.1"
    Navigate tree to referenced node
    """

def retrieve(query: str, tree: TreeNode, doc_path: str, max_iterations: int = 5):
    context = ""
    for _ in range(max_iterations):
        nodes = select_relevant_nodes(query, tree)
        context += extract_content(nodes)

        eval = evaluate_sufficiency(query, context)
        if eval['status'] == 'sufficient':
            return context
        elif eval['status'] == 'follow_reference':
            ref_node = follow_cross_reference(context, tree)
            context += extract_content([ref_node])

    return context

CONFIG = {
    # Indexing
    'max_pages_per_node': 10,      # 5-15 optimal (too small = overhead, too large = reverts to chunking)
    'max_tokens_per_node': 20000,  # Hard limit on node size
    'toc_check_pages': 20,         # Pages to scan for ToC

    # Retrieval
    'max_iterations': 5,            # Prevent infinite loops
    'max_nodes_per_iteration': 3,   # Sections to check simultaneously

    # LLM
    'model': 'gpt-4o-2024-11-20',  # Or claude-sonnet-4-5
    'temperature': 0.1,             # Low for consistent reasoning
}

from langchain.schema import BaseRetriever, Document

class PageIndexRetriever(BaseRetriever):
    tree: TreeNode
    document_path: str

    def get_relevant_documents(self, query: str) -> List[Document]:
        context = retrieve(query, self.tree, self.document_path)
        return [Document(page_content=context)]

def hybrid_retrieve(query: str, doc_type: str):
    if doc_type == "structured":
        # Financial reports, contracts → PageIndex
        return pageindex_retrieve(query)
    else:
        # Unstructured content → vector search
        return vector_retrieve(query)

PageIndex RAG Architecture

Core Innovation: Why Vector RAG Fails

PageIndex Solution

PageIndex RAG Architecture

Core Innovation: Why Vector RAG Fails

PageIndex Solution

When to Use PageIndex vs Vector RAG

Implementation Workflow

1. Build Hierarchical Index

2. Implement Reasoning-Based Retrieval

Key Configuration Parameters

Architecture Deep Dive

Common Pitfalls

Integration Examples

Performance Characteristics

Notion

Feishu Wiki

Gemini

Obsidian Vault Maintainer

Openclaw Pr Maintainer

Wiki Maintainer